Running the script takes 10 minutes. And this is too slow, especially when I want to find the similar documents of different documents. Today I implement a caching system to store the preprocessed results.
- Get data
- Preprocess documents <-- bottleneck
- Count matrix
- SVD
- Compute distances
Caching by saving intermediates
There are a couple ways to save data.
- as a human readable CSV file (
pandas.DataFrame.to_csv
), or
df.to_csv(preprocess_file, index=False, sep="\t")
df = pd.read_csv(preprocess_file, sep="\t")
- as a binary pickle file (
pandas.DataFrame.to_pickle
)
df.to_pickle(preprocess_file+".pkl")
df = pd.read_pickle(preprocess_file+".pkl")
When the two saved files are compared, I found both are 34M in size according to ls -lh
. However, when I try to load and use the CSV file in Python, I get an error complaining about value types.
ValueError: np.nan is an invalid document, expected byte or unicode string.
As a result, I used the pickled file in production (see script: ea60d03
; diff).
How well did I do?
Originally, the script took 10 minutes to run. Now it takes 20 seconds. We have made the script 30 times faster (for subsequent runs).
Much better.