Scraping HTML & Javascript generated webpages

In the previous post, I took rating data from The Goulet Pen website to look for the highest rated item matching a shortlist of items. The concept is simple, but getting the data required a bit more work.

Normally I would use Hadley Wickham’s rvest package to scrape websites with R, but I found that this method did not work: I could get the item names, but not the item ratings.

library(rvest)

read_html(THE_GOULET_PEN_URL) %>%
  html_nodes("div.mz-productlisting") %>%
  html_text()

I was confused.

When I open the URL in a browser and inspect the elements, I could clearly see the ratings embedded in the HTML. And, not knowing any better, I then proceeded to try python’s Beautiful Soup.

from bs4 import BeautifulSoup
from urllib import urlopen

f = urllib.urlopen(THE_GOULET_PEN_URL)
soup = BeautifulSoup(f)

soup.findAll("div", class_="mz-productlisting")[0].get_text()

Still nothing.

Eventually – after opening the URL in a javascript disabled browser and not seeing any ratings – I realized the ratings were dynamically generating by javascript.

Now that I knew what to look for (and finding Ross Kippenbrock’s post on Shot Blocking in the NHL Playoffs), I was able to scrape the javascript generated content using selenium and chromedriver.