A couple days ago, I was trying to factorize a document-word count matrix with SVD. However (during the factorization step), my job kept on crashing because the matrix was too big.
Attempting matrix factorization with SVD in #Python. Day51 of #100daysofcode. #datascience https://t.co/08IKBoFjKM pic.twitter.com/mO9Wz9lHqm
— Celia S. Siu (@celiassiu) April 17, 2017
On twitter, Jake suggested to check out Scikit-learn.
@celiassiu Check out the irlba R library as well or scikit learn
— Jake Lever (@jakelever0) April 17, 2017
The Jupyter Notebook for this little project is found here.
SVD with Scikit-learn
In Scikit-learn, sklearn.decomposition.TruncatedSVD is used to compute SVD.
Applying the TruncatedSVD
to our matrix for 100 components, it works!
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=100, random_state=5)
svd.fit(X)
However, the , , and matrices are not returned (only the somewhat reconstructed one with components).
Getting , , and
According maxymoo on stackoverflow (2015), TruncatedSVD
is a wrapper and sklearn.utils.extmath.randomized_svd
can be used to manually call SVD.
from sklearn.utils.extmath import randomized_svd
U, s, Vh = randomized_svd(X, n_components=100, n_iter=5, random_state=5)
extmath.randomized_svd: compute the k-truncated randomized SVD. This algorithm finds the exact truncated singular values decomposition using randomization to speed up the computations. It is particularly fast on large matrices on which you wish to extract only a small number of components.
http://scikit-learn.org/stable/developers/utilities.html
Future work
Now that I have , , and , I can investigate questions such as given a document, what is the most similar document to it in the corpus.