Day07: Flesch reading ease

Posted by csiu on March 3, 2017 | with: 100daysofcode, Text mining

DAY 07 - Mar 3, 2017

The Flesch reading ease score (mentioned yesterday) got me thinking. How would different passages of text rate in terms of ease of understandability?

The Jupyter Notebook for this little project is found here.

Flesch reading ease

Flesch reading ease is a measure of how difficult a passage in English is to understand. The formula for the readability ease measure is calculated as follows:

RE = 206.835 – (1.015 x ASL) – (84.6 x ASW)

where ASL is the average sentence length (total words/total sentences) and ASW is the average number of syllables per word (total syllables/total words).

def readability_ease(num_sentences, num_words, num_syllables):
    asl = num_words / num_sentences
    asw = num_syllables / num_words

    return(206.835 - (1.015 * asl) - (84.6 * asw))

The score ranges from 0 to 100 and a higher scores indicate material that is easier to read.

def readability_ease_interpretation(x):
    if 90 <= x:
        res = "5th grade] "
        res += "Very easy to read. Easily understood by an average 11-year-old student."

    elif 80 <= x < 90:
        res = "6th grade] "
        res += "Easy to read. Conversational English for consumers."

    elif 70 <= x < 80:
        res = "7th grade] "
        res += "Fairly easy to read."

    elif 60 <= x < 70:
        res = "8th & 9th grade] "
        res += "Plain English. Easily understood by 13- to 15-year-old students."

    elif 50 <= x < 60:
        res = "10th to 12th grade] "
        res += "Fairly difficult to read."

    elif 30 <= x < 50:
        res = "College] "
        res += "Difficult to read."

    elif 0 <= x < 30:
        res = "College Graduate] "
        res += "Very difficult to read. Best understood by university graduates."

    print("[{:.1f}|{}".format(x, res))

Test case

In this test case, we have 12 words, 14 syllables, and 3 sentences:

text = "Hello world, how are you? I am great. Thank you for asking!"

Counting words

Counting words is easy.

import nltk
import re

text = text.lower()

words = nltk.wordpunct_tokenize(re.sub('[^a-zA-Z_ ]', '',text))
num_words = len(words)

print(words)
print(num_words)
['hello', 'world', 'how', 'are', 'you', 'i', 'am', 'great', 'thank', 'you', 'for', 'asking']
12

Counting syllables

Counting syllables is a bit more tricky. According to Using Python and the NLTK to Find Haikus in the Public Twitter Stream by Brandon Wood (2013), the Carnegie Mellon University (CMU) Pronouncing Dictionary corpora contain the syllable count for over 125,000 (English) words and thus could be used to count syllables.

from nltk.corpus import cmudict
from curses.ascii import isdigit

d = cmudict.dict()

def count_syllables(word):
    return([len(list(y for y in x if isdigit(y[-1]))) for x in d[word.lower()]][0])
print("Number of syllables per word", "="*28, sep="\n")
for word in words:
    num_syllables = count_syllables(word)
    print("{}: {}".format(word, num_syllables))
Number of syllables per word
============================
hello: 2
world: 1
how: 1
are: 1
you: 1
i: 1
am: 1
great: 1
thank: 1
you: 1
for: 1
asking: 2

Counting sentences

Counting sentences was already done in Day03.

sentences = nltk.tokenize.sent_tokenize(text)
num_sentences = len(sentences)

print("Number of sentences: {}".format(num_sentences), "="*25, sep="\n")
for sentence in sentences:
    print(sentence)
Number of sentences: 3
=========================
hello world, how are you?
i am great.
thank you for asking!

Putting it all together

def flesch_reading_ease(text):
    ## Preprocessing
    text = text.lower()

    sentences = nltk.tokenize.sent_tokenize(text)
    words = nltk.wordpunct_tokenize(re.sub('[^a-zA-Z_ ]', '',text))

    ## Count
    num_sentences = len(sentences)
    num_words = len(words)
    num_syllables = sum([count_syllables(word) for word in words])

    ## Calculate
    fre = readability_ease(num_sentences, num_words, num_syllables)
    return(fre)

We find the test text was constructed at a 5th grade level.

fre = flesch_reading_ease(text)
readability_ease_interpretation(fre)
[104.1|5th grade] Very easy to read. Easily understood by an average 11-year-old student.

It’s strange the score is above 100.

What about Shakespeare?

# (As You Like it Act 2, Scene 7)
text = """
All the world's a stage,
and all the men and women merely players.
They have their exits and their entrances;
And one man in his time plays many parts
"""

fre = flesch_reading_ease(text)
readability_ease_interpretation(fre)
[87.1|6th grade] Easy to read. Conversational English for consumers.

… Thus, Shakespeare is be doable in high school.