Day56: Finding the similar documents

Posted by csiu on April 21, 2017 | with: 100daysofcode, Machine Learning

With regards to finding the most similar document, the work was done using a number of Jupyter Notebooks.

Jupyter Notebooks are excellent for exploratory analysis and trying something new, but its cumbersome for code reuse and robust applications.

In this post, I describe how I reimplemented the different steps (drawing chunks of code from the different Jupyter Notebooks) into a single Python script (see below).


Bringing together the individual components

The first step is to create the working scripts (f47ffa9) and then we can modify it and make it better.

  • Getting the data and text preprocessing (Day 49; ipynb)
  • Constructing the document-word count matrix (Day 50; ipynb)
  • Factorizing by SVD (Day 53; ipynb)
  • Minimizing document distances and identifying similar documents (Day 55; ipynb)


Modularizing

Now that we have a working script, we can – as done in Day 28 – restructure and modularize the code for readability and ease of maintenance and reuse (f176a6a; diff).

I also like to wrap the code with an if-statement so that I can load the script as a module without inadvertently running the commands (d55acf9; diff).

if __name__ == '__main__':
    # Define commands here
    # ...


Varying parameters with a parser

Finally, I want to make use of a parser to be able to alter my parameters (91e1f97; diff).

import argparse


## Define parser
parser = argparse.ArgumentParser(description="")
parser.add_argument('-s', '--num_singular_values', default=100, type=int,
                    help="Number of singular values to use from SVD")
## ... more args ...


## Get and do something with arguments
args = parser.parse_args()
print(args.num_singular_values)


The script


Can’t be exploring data and writing one-off scripts every day. Code needs to be made sense of and maintained. – csiu