Day56: Finding the similar documents

With regards to finding the most similar document, the work was done using a number of Jupyter Notebooks.

Jupyter Notebooks are excellent for exploratory analysis and trying something new, but its cumbersome for code reuse and robust applications.

In this post, I describe how I reimplemented the different steps (drawing chunks of code from the different Jupyter Notebooks) into a single Python script (see below).

Bringing together the individual components

The first step is to create the working scripts (f47ffa9) and then we can modify it and make it better.

Getting the data and text preprocessing (Day 49; ipynb)
Constructing the document-word count matrix (Day 50; ipynb)
Factorizing by SVD (Day 53; ipynb)
Minimizing document distances and identifying similar documents (Day 55; ipynb)

Modularizing

Now that we have a working script, we can – as done in Day 28 – restructure and modularize the code for readability and ease of maintenance and reuse (f176a6a; diff).

I also like to wrap the code with an if-statement so that I can load the script as a module without inadvertently running the commands (d55acf9; diff).

if __name__ == '__main__':
    # Define commands here
    # ...

Varying parameters with a parser

Finally, I want to make use of a parser to be able to alter my parameters (91e1f97; diff).

import argparse


## Define parser
parser = argparse.ArgumentParser(description="")
parser.add_argument('-s', '--num_singular_values', default=100, type=int,
                    help="Number of singular values to use from SVD")
## ... more args ...


## Get and do something with arguments
args = parser.parse_args()
print(args.num_singular_values)

The script

Can’t be exploring data and writing one-off scripts every day. Code needs to be made sense of and maintained. – csiu