Day49: Text preprocessing

I eventually want to do text analysis with the Kickstarter data, but first I’ll need to do some data cleaning and text preprocessing.

The Jupyter Notebook for this little project is found here.

Select data for analysis

The data for this project has been loaded to a PostgreSQL database. The columns of the database can be found on the Jupyter Notebook of Day44. From the data, I want to use columns “name” and “blurb”.

	id	name	blurb
0	1312331512	Otherkin The Animated Series	We have a fully developed 2D animated series t...
1	80827270	Paradigm Spiral - The Animated Series	A sci-fi fantasy 2.5D anime styled series abou...
2	737219121	I'm Sticking With You.	A film created entirely out of paper, visual e...

Both columns are text fields and I want to combine and treat them as a single “document”.

document - collection of text

In PostgreSQL, you can combine/concatenate rows with a whitespace separator using the concat_ws command.

SELECT id, concat_ws(name, blurb) FROM info

	id	document
0	1312331512	We have a fully developed 2D animated series t...
1	80827270	A sci-fi fantasy 2.5D anime styled series abou...
2	737219121	A film created entirely out of paper, visual e...

Text preprocessing

In text preprocessing, we do the following:

# Given some text
# (https://www.kickstarter.com/projects/44334963/otherkin-the-animated-series/comments)
'A sci-fi fantasy 2.5D anime styled series about some guys trying to save the world, probably...'

# 1. Convert text to lowercase
'a sci-fi fantasy 2.5d anime styled series about some guys trying to save the world, probably...'

# 2. Remove non-letter (such as digits)
'a scifi fantasy d anime styled series about some guys trying to save the world probably'

# 3. Tokenize (ie. break text into meaningful elements such as words)
    ['a',
     'scifi',
     'fantasy',
     'd',
     'anime',
     'styled',
     'series',
     'about',
     'some',
     'guys',
     'trying',
     'to',
     'save',
     'the',
     'world',
     'probably']

# 4. Remove stopwords
    ['scifi',
     'fantasy',
     'anime',
     'styled',
     'series',
     'guys',
     'trying',
     'save',
     'world',
     'probably']

In Python’s nltk.corpus, there are 153 english stopwords which we can use in filtering for words with significance.

stopwords - words which do not contain important significance

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']

Reference to remove stop words: Part 1: For Beginners - Bag of Words
Reference for How do I do word Stemming or Lemmatization?

Stemming vs Lemmatizing

At this point, there are various things we can do to the data including stemming and lemmatizing which would allow us to treat “greets”, “greet”, and “greeting” as the same word.

stemming - process of reducing inflected (or sometimes derived) words to their word stem, base or root form

lemmatizing - process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word’s lemma, or dictionary form

Applying the PorterStemmer and WordNetLemmatizer to the example, we find:

	raw	stemming	lemmatizing
0	scifi	scifi	scifi
1	fantasy	fantasi	fantasy
2	anime	anim	anime
3	styled	style	styled
4	series	seri	series
5	guys	guy	guy
6	trying	tri	trying
7	save	save	save
8	world	world	world
9	probably	probabl	probably

Some words are untouched:
- scifi
- save
- world
Some words are touched only in stemming:
- fantsy->fantasi
- anime->anim
- styled->style
- series->seri
- trying->tri
- probably->probabl
Some words agree in its stemming and lemmatizing:
- guys->guy

Putting it all together

def text_processing(text, method=None):
    # Lower case
    text = text.lower()

    # Remove non-letters &
    # Tokenize    
    words = nltk.wordpunct_tokenize(re.sub('[^a-zA-Z_ ]', '', text))

    # Remove stop words
    words = [w for w in words if not w in stopwords.words("english")]

    # Stemming vs Lemmatizing vs do nothing
    if method == "stem":
        port = PorterStemmer()
        words = [port.stem(w) for w in words]
    elif method == "lemm":
        wnl = WordNetLemmatizer()
        words = [wnl.lemmatize(w) for w in words]

    return(words)