I eventually want to do text analysis with the Kickstarter data, but first I’ll need to do some data cleaning and text preprocessing.
The Jupyter Notebook for this little project is found here.
Select data for analysis
The data for this project has been loaded to a PostgreSQL database. The columns of the database can be found on the Jupyter Notebook of Day44. From the data, I want to use columns “name” and “blurb”.
id | name | blurb | |
---|---|---|---|
0 | 1312331512 | Otherkin The Animated Series | We have a fully developed 2D animated series t... |
1 | 80827270 | Paradigm Spiral - The Animated Series | A sci-fi fantasy 2.5D anime styled series abou... |
2 | 737219121 | I'm Sticking With You. | A film created entirely out of paper, visual e... |
Both columns are text fields and I want to combine and treat them as a single “document”.
document - collection of text
In PostgreSQL, you can combine/concatenate rows with a whitespace separator using the concat_ws
command.
SELECT id, concat_ws(name, blurb) FROM info
id | document | |
---|---|---|
0 | 1312331512 | We have a fully developed 2D animated series t... |
1 | 80827270 | A sci-fi fantasy 2.5D anime styled series abou... |
2 | 737219121 | A film created entirely out of paper, visual e... |
Text preprocessing
In text preprocessing, we do the following:
# Given some text
# (https://www.kickstarter.com/projects/44334963/otherkin-the-animated-series/comments)
'A sci-fi fantasy 2.5D anime styled series about some guys trying to save the world, probably...'
# 1. Convert text to lowercase
'a sci-fi fantasy 2.5d anime styled series about some guys trying to save the world, probably...'
# 2. Remove non-letter (such as digits)
'a scifi fantasy d anime styled series about some guys trying to save the world probably'
# 3. Tokenize (ie. break text into meaningful elements such as words)
['a',
'scifi',
'fantasy',
'd',
'anime',
'styled',
'series',
'about',
'some',
'guys',
'trying',
'to',
'save',
'the',
'world',
'probably']
# 4. Remove stopwords
['scifi',
'fantasy',
'anime',
'styled',
'series',
'guys',
'trying',
'save',
'world',
'probably']
In Python’s nltk.corpus
, there are 153 english stopwords which we can use in filtering for words with significance.
stopwords - words which do not contain important significance
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']
- Reference to remove stop words: Part 1: For Beginners - Bag of Words
- Reference for How do I do word Stemming or Lemmatization?
Stemming vs Lemmatizing
At this point, there are various things we can do to the data including stemming and lemmatizing which would allow us to treat “greets”, “greet”, and “greeting” as the same word.
stemming - process of reducing inflected (or sometimes derived) words to their word stem, base or root form
lemmatizing - process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word’s lemma, or dictionary form
Applying the PorterStemmer
and WordNetLemmatizer
to the example, we find:
raw | stemming | lemmatizing | |
---|---|---|---|
0 | scifi | scifi | scifi |
1 | fantasy | fantasi | fantasy |
2 | anime | anim | anime |
3 | styled | style | styled |
4 | series | seri | series |
5 | guys | guy | guy |
6 | trying | tri | trying |
7 | save | save | save |
8 | world | world | world |
9 | probably | probabl | probably |
- Some words are untouched:
- scifi
- save
- world
- Some words are touched only in stemming:
- fantsy->fantasi
- anime->anim
- styled->style
- series->seri
- trying->tri
- probably->probabl
- Some words agree in its stemming and lemmatizing:
- guys->guy
Putting it all together
def text_processing(text, method=None):
# Lower case
text = text.lower()
# Remove non-letters &
# Tokenize
words = nltk.wordpunct_tokenize(re.sub('[^a-zA-Z_ ]', '', text))
# Remove stop words
words = [w for w in words if not w in stopwords.words("english")]
# Stemming vs Lemmatizing vs do nothing
if method == "stem":
port = PorterStemmer()
words = [port.stem(w) for w in words]
elif method == "lemm":
wnl = WordNetLemmatizer()
words = [wnl.lemmatize(w) for w in words]
return(words)