Day49: Text preprocessing

Posted by csiu on April 14, 2017 | with: 100daysofcode

I eventually want to do text analysis with the Kickstarter data, but first I’ll need to do some data cleaning and text preprocessing.

The Jupyter Notebook for this little project is found here.

Select data for analysis

The data for this project has been loaded to a PostgreSQL database. The columns of the database can be found on the Jupyter Notebook of Day44. From the data, I want to use columns “name” and “blurb”.

id name blurb
0 1312331512 Otherkin The Animated Series We have a fully developed 2D animated series t...
1 80827270 Paradigm Spiral - The Animated Series A sci-fi fantasy 2.5D anime styled series abou...
2 737219121 I'm Sticking With You. A film created entirely out of paper, visual e...

Both columns are text fields and I want to combine and treat them as a single “document”.

document - collection of text

In PostgreSQL, you can combine/concatenate rows with a whitespace separator using the concat_ws command.

SELECT id, concat_ws(name, blurb) FROM info
id document
0 1312331512 We have a fully developed 2D animated series t...
1 80827270 A sci-fi fantasy 2.5D anime styled series abou...
2 737219121 A film created entirely out of paper, visual e...

Text preprocessing

In text preprocessing, we do the following:

# Given some text
# (https://www.kickstarter.com/projects/44334963/otherkin-the-animated-series/comments)
'A sci-fi fantasy 2.5D anime styled series about some guys trying to save the world, probably...'

# 1. Convert text to lowercase
'a sci-fi fantasy 2.5d anime styled series about some guys trying to save the world, probably...'

# 2. Remove non-letter (such as digits)
'a scifi fantasy d anime styled series about some guys trying to save the world probably'

# 3. Tokenize (ie. break text into meaningful elements such as words)
    ['a',
     'scifi',
     'fantasy',
     'd',
     'anime',
     'styled',
     'series',
     'about',
     'some',
     'guys',
     'trying',
     'to',
     'save',
     'the',
     'world',
     'probably']

# 4. Remove stopwords
    ['scifi',
     'fantasy',
     'anime',
     'styled',
     'series',
     'guys',
     'trying',
     'save',
     'world',
     'probably']

In Python’s nltk.corpus, there are 153 english stopwords which we can use in filtering for words with significance.

stopwords - words which do not contain important significance

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']

Stemming vs Lemmatizing

At this point, there are various things we can do to the data including stemming and lemmatizing which would allow us to treat “greets”, “greet”, and “greeting” as the same word.

stemming - process of reducing inflected (or sometimes derived) words to their word stem, base or root form

lemmatizing - process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word’s lemma, or dictionary form

Applying the PorterStemmer and WordNetLemmatizer to the example, we find:

raw stemming lemmatizing
0 scifi scifi scifi
1 fantasy fantasi fantasy
2 anime anim anime
3 styled style styled
4 series seri series
5 guys guy guy
6 trying tri trying
7 save save save
8 world world world
9 probably probabl probably
  1. Some words are untouched:
    • scifi
    • save
    • world
  2. Some words are touched only in stemming:
    • fantsy->fantasi
    • anime->anim
    • styled->style
    • series->seri
    • trying->tri
    • probably->probabl
  3. Some words agree in its stemming and lemmatizing:
    • guys->guy

Putting it all together

def text_processing(text, method=None):
    # Lower case
    text = text.lower()

    # Remove non-letters &
    # Tokenize    
    words = nltk.wordpunct_tokenize(re.sub('[^a-zA-Z_ ]', '', text))

    # Remove stop words
    words = [w for w in words if not w in stopwords.words("english")]

    # Stemming vs Lemmatizing vs do nothing
    if method == "stem":
        port = PorterStemmer()
        words = [port.stem(w) for w in words]
    elif method == "lemm":
        wnl = WordNetLemmatizer()
        words = [wnl.lemmatize(w) for w in words]

    return(words)