With regards to finding the most similar document, the work was done using a number of Jupyter Notebooks.
Jupyter Notebooks are excellent for exploratory analysis and trying something new, but its cumbersome for code reuse and robust applications.
In this post, I describe how I reimplemented the different steps (drawing chunks of code from the different Jupyter Notebooks) into a single Python script (see below).
Bringing together the individual components
The first step is to create the working scripts (f47ffa9
) and then we can modify it and make it better.
- Getting the data and text preprocessing (Day 49; ipynb)
- Constructing the document-word count matrix (Day 50; ipynb)
- Factorizing by SVD (Day 53; ipynb)
- Minimizing document distances and identifying similar documents (Day 55; ipynb)
Modularizing
Now that we have a working script, we can – as done in Day 28 – restructure and modularize the code for readability and ease of maintenance and reuse (f176a6a
; diff).
I also like to wrap the code with an if-statement so that I can load the script as a module without inadvertently running the commands (d55acf9
; diff).
if __name__ == '__main__':
# Define commands here
# ...
Varying parameters with a parser
Finally, I want to make use of a parser to be able to alter my parameters (91e1f97
; diff).
import argparse
## Define parser
parser = argparse.ArgumentParser(description="")
parser.add_argument('-s', '--num_singular_values', default=100, type=int,
help="Number of singular values to use from SVD")
## ... more args ...
## Get and do something with arguments
args = parser.parse_args()
print(args.num_singular_values)
The script
Can’t be exploring data and writing one-off scripts every day. Code needs to be made sense of and maintained. – csiu