Day64: TD-IDF

Today I look at another presentation of words in documents. Previously, I used raw counts and today I consider TD-IDF.

The associated modifications to the Python script is found here.

TF-IDF

TF-IDF representation is another representation of words in documents. TF-IDF stands for Term Frequency-Inverse Document Frequency and is intended to reflect how important a word is to a document in a corpus. A word that appears in the document more times is considered more relevant to the document; but if that word appears more frequently in general (words such as “the”, “if”, or “a”), then this word might not be that relevant.

Reference: Tf–idf term weighting

TF-IDF in Python

In Python (ie. scikit-learn), TF-IDF is computed by TfidfVectorizer (from raw text) or TfidfTransformer (from a count matrix).

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
mat = vectorizer.fit_transform(list_of_documents)


# Print array
mat.toarray()