Day53: SVD in Scikit-learn

Posted by csiu on April 18, 2017 | with: 100daysofcode, Machine Learning

A couple days ago, I was trying to factorize a document-word count matrix with SVD. However (during the factorization step), my job kept on crashing because the matrix was too big.

On twitter, Jake suggested to check out Scikit-learn.

The Jupyter Notebook for this little project is found here.

SVD with Scikit-learn

In Scikit-learn, sklearn.decomposition.TruncatedSVD is used to compute SVD.

Applying the TruncatedSVD to our matrix for 100 components, it works!

from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=100, random_state=5)

However, the , , and matrices are not returned (only the somewhat reconstructed one with components).

Getting , , and

According maxymoo on stackoverflow (2015), TruncatedSVD is a wrapper and sklearn.utils.extmath.randomized_svd can be used to manually call SVD.

from sklearn.utils.extmath import randomized_svd

U, s, Vh = randomized_svd(X, n_components=100, n_iter=5, random_state=5)

extmath.randomized_svd: compute the k-truncated randomized SVD. This algorithm finds the exact truncated singular values decomposition using randomization to speed up the computations. It is particularly fast on large matrices on which you wish to extract only a small number of components.

Future work

Now that I have , , and , I can investigate questions such as given a document, what is the most similar document to it in the corpus.