Today I look at another presentation of words in documents. Previously, I used raw counts and today I consider TD-IDF.
The associated modifications to the Python script is found here.
TF-IDF
TF-IDF representation is another representation of words in documents. TF-IDF stands for Term Frequency-Inverse Document Frequency and is intended to reflect how important a word is to a document in a corpus. A word that appears in the document more times is considered more relevant to the document; but if that word appears more frequently in general (words such as “the”, “if”, or “a”), then this word might not be that relevant.
- Reference: Tf–idf term weighting
TF-IDF in Python
In Python (ie. scikit-learn), TF-IDF is computed by TfidfVectorizer
(from raw text) or TfidfTransformer
(from a count matrix).
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
mat = vectorizer.fit_transform(list_of_documents)
# Print array
mat.toarray()