Tomorrow I’ll be attending the CSV Conference in Portland and there are many speakers giving talks. For today’s project, I’ll cluster the talks for which the abstract is available (ie. the non-keynote talks).
The Jupyter Notebook for this little project is found here.
Text: Obtaining, preprocessing, and representing
In the first step, we scrape the the speaker names, talk titles, and talk abstracts using BeautifulSoup
. We save the data in a pandas
data frame and preprocess the title+abstract text (ie. tokenize, remove stop words, and stem) using the function defined in Day 49: Text preprocessing. The talks are then represented by the processed words using TF-IDF (Day 64).
Cluster the talks
After creating the document-word TF-IDF matrix, I referred to Jörn Hees’ (2015) tutorial to generate the hierarchical clustering and dendrogram using scipy.cluster.hierarchy.dendrogram
.
I label the clusters as follow:
Colour | Label |
---|---|
Green (1st) | metadata |
Red (1st) | Data collections |
Cyan | City data |
Purple | Government data |
Yellow | Jupyter Notebooks |
Black | Data modelling and design |
Green (2nd) | Data analysis |
Red (2nd) | Looking at data in different ways |