Day66: Hierarchical clustering

Tomorrow I’ll be attending the CSV Conference in Portland and there are many speakers giving talks. For today’s project, I’ll cluster the talks for which the abstract is available (ie. the non-keynote talks).

The Jupyter Notebook for this little project is found here.

Text: Obtaining, preprocessing, and representing

In the first step, we scrape the the speaker names, talk titles, and talk abstracts using BeautifulSoup. We save the data in a pandas data frame and preprocess the title+abstract text (ie. tokenize, remove stop words, and stem) using the function defined in Day 49: Text preprocessing. The talks are then represented by the processed words using TF-IDF (Day 64).

Cluster the talks

After creating the document-word TF-IDF matrix, I referred to Jörn Hees’ (2015) tutorial to generate the hierarchical clustering and dendrogram using scipy.cluster.hierarchy.dendrogram.

I label the clusters as follow:

Colour	Label
Green (1st)	metadata
Red (1st)	Data collections
Cyan	City data
Purple	Government data
Yellow	Jupyter Notebooks
Black	Data modelling and design
Green (2nd)	Data analysis
Red (2nd)	Looking at data in different ways