Day57: Caching results

Running the script takes 10 minutes. And this is too slow, especially when I want to find the similar documents of different documents. Today I implement a caching system to store the preprocessed results.

Get data
Preprocess documents <-- bottleneck
Count matrix
SVD
Compute distances

Caching by saving intermediates

There are a couple ways to save data.

as a human readable CSV file (pandas.DataFrame.to_csv), or

df.to_csv(preprocess_file, index=False, sep="\t")
df = pd.read_csv(preprocess_file, sep="\t")

as a binary pickle file (pandas.DataFrame.to_pickle)

df.to_pickle(preprocess_file+".pkl")
df = pd.read_pickle(preprocess_file+".pkl")

When the two saved files are compared, I found both are 34M in size according to ls -lh. However, when I try to load and use the CSV file in Python, I get an error complaining about value types.

ValueError: np.nan is an invalid document, expected byte or unicode string.

As a result, I used the pickled file in production (see script: ea60d03; diff).

How well did I do?

Originally, the script took 10 minutes to run. Now it takes 20 seconds. We have made the script 30 times faster (for subsequent runs).

Much better.