Day57: Caching results

Posted by csiu on April 22, 2017 | with: 100daysofcode

Running the script takes 10 minutes. And this is too slow, especially when I want to find the similar documents of different documents. Today I implement a caching system to store the preprocessed results.

  1. Get data
  2. Preprocess documents <-- bottleneck
  3. Count matrix
  4. SVD
  5. Compute distances


Caching by saving intermediates

There are a couple ways to save data.

df.to_csv(preprocess_file, index=False, sep="\t")
df = pd.read_csv(preprocess_file, sep="\t")
df.to_pickle(preprocess_file+".pkl")
df = pd.read_pickle(preprocess_file+".pkl")


When the two saved files are compared, I found both are 34M in size according to ls -lh. However, when I try to load and use the CSV file in Python, I get an error complaining about value types.

ValueError: np.nan is an invalid document, expected byte or unicode string.

As a result, I used the pickled file in production (see script: ea60d03; diff).


How well did I do?

Originally, the script took 10 minutes to run. Now it takes 20 seconds. We have made the script 30 times faster (for subsequent runs).

Much better.