Day08: Flesch Reading Ease on Reddit

Posted by csiu on March 4, 2017 | with: 100daysofcode, Text mining,

In continuation with yesterday’s post, I want to get a sense of the reading level of Reddit.

The Jupyter Notebook for this little project is found here. After considering the layout of yesterday’s post, I prefer the minimalistic style whereby I mention the highlights of my analysis. If you want to see the code, you can find it in the Jupyter Notebook.

Upgrading the Flesch Reading Ease function

There were a number of problems with the flesch_reading_ease() function defined yesterday. For one, the code would break if it encountered a word that did not exist. I therefore modified the function by adding the following:

  • Made it so that words not in the dictionary are skipped so that our program does not break
  • Ensure there is no division by 0, returning None type instead. (Note: this is integrated with count_syllables() function)
  • Add show_details argument to show the number of sentences, words, syllables, and unknown words making up the Flesch Reading Ease score for debugging
def flesch_reading_ease(text, show_details = False):
    ## Preprocessing
    text = text.lower()

    sentences = nltk.tokenize.sent_tokenize(text)
    words = nltk.wordpunct_tokenize(re.sub('[^a-zA-Z_ ]', '',text))
    syllables = [count_syllables(word) for word in words]

    ## Count sentences, words, and syllables
    ## Skip words that do not exist in dictionary
    num_sentences = len(sentences)
    num_unknown_words = syllables.count(-1)
    num_words = len(words) - num_unknown_words
    num_syllables = sum([s for s in syllables if s > 0])

    ## Calculate
    if (num_sentences == 0 or num_words == 0): return None   
    fre = readability_ease(num_sentences, num_words, num_syllables)

    if show_details:
        return {
            "sentence" : num_sentences,
            "word" : num_words,
            "syllables" : num_syllables,
            "unknown": num_unknown_words
        }
    return(fre)

Same thread, different question

In Day03, we access Reddit comments from an API to do sentiment analysis. Let’s now find the Flesch Reading Ease (FRE) of the Men of Reddit, what’s the biggest “I’m a princess” red flag? (thewhackcat, 2017) thread.

We do so by calculating the FRE score for each comment and then taking the average of all scores. As a result, we find the comments are at a 7th grade level (FRE = 71.9) which means the comments are “Fairly easy to read”.

More threads

What about other threads? I selected a thread from a few popular subreddits:

## Parse reddit comments and calculate FRE for 75 random comments
## Note: the number of scores is equal across categories
df = {}
for k in urls.keys():
    data = parse_reddit(reddit, urls[k])
    fres = [flesch_reading_ease(text) for text in data]

    fres = [fre for fre in fres if fre != None]
    df[k] = random.sample(k=75, population=fres)

df = pd.DataFrame(df)

      0-30          30-50     50-60     60-70   70-80   80-90   90+
      College grad  College   10-12th   8-9th   7th     6th     5th

From the boxplot, we find:

  • People’s shower thoughts are relatively comprehensive
  • ELIF, which stands for “explain like I’m five”, would not be understood by 5 year olds
  • Asking Reddit is easily understood by a younger audience