Day66: Hierarchical clustering

Posted by csiu on May 1, 2017 | with: 100daysofcode, Machine Learning
  by Celia Siu (in Portland)

Tomorrow I’ll be attending the CSV Conference in Portland and there are many speakers giving talks. For today’s project, I’ll cluster the talks for which the abstract is available (ie. the non-keynote talks).

The Jupyter Notebook for this little project is found here.

Text: Obtaining, preprocessing, and representing

In the first step, we scrape the the speaker names, talk titles, and talk abstracts using BeautifulSoup. We save the data in a pandas data frame and preprocess the title+abstract text (ie. tokenize, remove stop words, and stem) using the function defined in Day 49: Text preprocessing. The talks are then represented by the processed words using TF-IDF (Day 64).

Cluster the talks

After creating the document-word TF-IDF matrix, I referred to Jörn Hees’ (2015) tutorial to generate the hierarchical clustering and dendrogram using scipy.cluster.hierarchy.dendrogram.

I label the clusters as follow:

Colour Label
Green (1st) metadata
Red (1st) Data collections
Cyan City data
Purple Government data
Yellow Jupyter Notebooks
Black Data modelling and design
Green (2nd) Data analysis
Red (2nd) Looking at data in different ways