Today I extract the annotated categories from the downloaded Web Robots Kickstarter information.
Kickstarter.csv Kickstarter012.csv Kickstarter024.csv Kickstarter036.csv
Kickstarter001.csv Kickstarter013.csv Kickstarter025.csv Kickstarter037.csv
Kickstarter002.csv Kickstarter014.csv Kickstarter026.csv Kickstarter038.csv
Kickstarter003.csv Kickstarter015.csv Kickstarter027.csv Kickstarter039.csv
Kickstarter004.csv Kickstarter016.csv Kickstarter028.csv Kickstarter040.csv
Kickstarter005.csv Kickstarter017.csv Kickstarter029.csv Kickstarter041.csv
Kickstarter006.csv Kickstarter018.csv Kickstarter030.csv Kickstarter042.csv
Kickstarter007.csv Kickstarter019.csv Kickstarter031.csv Kickstarter043.csv
Kickstarter008.csv Kickstarter020.csv Kickstarter032.csv _version.2017-03-15
Kickstarter009.csv Kickstarter021.csv Kickstarter033.csv
Kickstarter010.csv Kickstarter022.csv Kickstarter034.csv
Kickstarter011.csv Kickstarter023.csv Kickstarter035.csv
Consistency in file structure
Before I begin, I want to ensure the structure of the 44 Kickstarter data files are all the same such that when I apply my methods to one file, these methods also apply to the other files. A quick look at the header head Kick*csv | grep "^id" | uniq
shows that all headers are the same:
id,photo,name,blurb,goal,pledged,state,slug,disable_communication,country,currency,currency_symbol,currency_trailing_code,deadline,state_changed_at,created_at,launched_at,staff_pick,backers_count,static_usd_rate,usd_pledged,creator,location,category,profile,spotlight,urls,source_url,friends,is_starred,is_backing,permissions
When we count the number of lines minus the header, there are 185,367 projects scrapped from Kickstarter.
Category labels from column “category”
One of the things I plan to do on is topic modelling. But before I do this, I want some sense of what topic each project is about. In the Web Robots data, we find the entries of column “category” as a string representation of a dictionary.
{"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/film%20&%20video/television"}},"color":16734574,"parent_id":11,"name":"Television","id":303,"position":17,"slug":"film & video/television"}
According to Jacob Gabrielson, 2009, you can convert a string representation of a dictionary to a dictionary with ast.literal_eval
(which is safer than using eval
).
import ast
df['topic'] = df['category'].map(lambda x: ast.literal_eval(x)['name'])
Originally I plan to load the data into a Postgres database for today’s entry, but I kept on getting errors. It’s late and so I will do the creating of a new database, tables, and loading of data tomorrow. -- csiu