Day41: Getting topics

Posted by csiu on April 6, 2017 | with: 100daysofcode,

Today I extract the annotated categories from the downloaded Web Robots Kickstarter information.

Kickstarter.csv     Kickstarter012.csv  Kickstarter024.csv  Kickstarter036.csv
Kickstarter001.csv  Kickstarter013.csv  Kickstarter025.csv  Kickstarter037.csv
Kickstarter002.csv  Kickstarter014.csv  Kickstarter026.csv  Kickstarter038.csv
Kickstarter003.csv  Kickstarter015.csv  Kickstarter027.csv  Kickstarter039.csv
Kickstarter004.csv  Kickstarter016.csv  Kickstarter028.csv  Kickstarter040.csv
Kickstarter005.csv  Kickstarter017.csv  Kickstarter029.csv  Kickstarter041.csv
Kickstarter006.csv  Kickstarter018.csv  Kickstarter030.csv  Kickstarter042.csv
Kickstarter007.csv  Kickstarter019.csv  Kickstarter031.csv  Kickstarter043.csv
Kickstarter008.csv  Kickstarter020.csv  Kickstarter032.csv  _version.2017-03-15
Kickstarter009.csv  Kickstarter021.csv  Kickstarter033.csv
Kickstarter010.csv  Kickstarter022.csv  Kickstarter034.csv
Kickstarter011.csv  Kickstarter023.csv  Kickstarter035.csv

Consistency in file structure

Before I begin, I want to ensure the structure of the 44 Kickstarter data files are all the same such that when I apply my methods to one file, these methods also apply to the other files. A quick look at the header head Kick*csv | grep "^id" | uniq shows that all headers are the same:

id,photo,name,blurb,goal,pledged,state,slug,disable_communication,country,currency,currency_symbol,currency_trailing_code,deadline,state_changed_at,created_at,launched_at,staff_pick,backers_count,static_usd_rate,usd_pledged,creator,location,category,profile,spotlight,urls,source_url,friends,is_starred,is_backing,permissions

When we count the number of lines minus the header, there are 185,367 projects scrapped from Kickstarter.

Category labels from column “category”

One of the things I plan to do on is topic modelling. But before I do this, I want some sense of what topic each project is about. In the Web Robots data, we find the entries of column “category” as a string representation of a dictionary.

{"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/film%20&%20video/television"}},"color":16734574,"parent_id":11,"name":"Television","id":303,"position":17,"slug":"film & video/television"}

According to Jacob Gabrielson, 2009, you can convert a string representation of a dictionary to a dictionary with ast.literal_eval (which is safer than using eval).

import ast

df['topic'] = df['category'].map(lambda x: ast.literal_eval(x)['name'])


Originally I plan to load the data into a Postgres database for today’s entry, but I kept on getting errors. It’s late and so I will do the creating of a new database, tables, and loading of data tomorrow. -- csiu