It’s been a long and full day, but I wanted to recreate the first image from the following blog post by David Robinson using my Kickstarter data .
New blog post: Examining the arc of 100,000 stories: a tidy analysis https://t.co/3kUzuxdDo8 #rstats pic.twitter.com/4XhBb5ixzy
— David Robinson (@drob) April 26, 2017
[In his analysis, he does] a simple analysis, examining what words tend to occur at particular points within a story, including words that characterize the beginning, middle, or end.
Workflow:
- Get data & preprocess the text by running this Jupyter Notebook
- Further wrangle the data and generate the plot by running this R script
Reflecting back, the median word position might change slightly had I not removed stop words and digits. The purpose of the text preprocessing was to standardize the words so that different variation of words mean the same thing, for instance “messaging” and “messages”. A trade-off.