Day53: SVD in Scikit-learn

A couple days ago, I was trying to factorize a document-word count matrix with SVD. However (during the factorization step), my job kept on crashing because the $177140 \times 99946$ matrix was too big.

Attempting matrix factorization with SVD in #Python. Day51 of #100daysofcode. #datascience https://t.co/08IKBoFjKM pic.twitter.com/mO9Wz9lHqm
— Celia S. Siu (@celiassiu) April 17, 2017

On twitter, Jake suggested to check out Scikit-learn.

@celiassiu Check out the irlba R library as well or scikit learn
— Jake Lever (@jakelever0) April 17, 2017

The Jupyter Notebook for this little project is found here.

SVD with Scikit-learn

In Scikit-learn, sklearn.decomposition.TruncatedSVD is used to compute SVD.

Applying the TruncatedSVD to our matrix for 100 components, it works!

from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=100, random_state=5)
svd.fit(X)

However, the $U$ , $\Sigma$ , and $V$ matrices are not returned (only the somewhat reconstructed one with $k$ components).

Getting $U$ , $\Sigma$ , and $V$

According maxymoo on stackoverflow (2015), TruncatedSVD is a wrapper and sklearn.utils.extmath.randomized_svd can be used to manually call SVD.

from sklearn.utils.extmath import randomized_svd

U, s, Vh = randomized_svd(X, n_components=100, n_iter=5, random_state=5)

extmath.randomized_svd: compute the k-truncated randomized SVD. This algorithm finds the exact truncated singular values decomposition using randomization to speed up the computations. It is particularly fast on large matrices on which you wish to extract only a small number of components.
http://scikit-learn.org/stable/developers/utilities.html

Future work

Now that I have $U$ , $\Sigma$ , and $V$ , I can investigate questions such as given a document, what is the most similar document to it in the corpus.

SVD with Scikit-learn

Getting U, \Sigma, and V

Future work

Getting $U$ , $\Sigma$ , and $V$