In the previous post, I took rating data from The Goulet Pen website to look for the highest rated item matching a shortlist of items. The concept is simple, but getting the data required a bit more work.
Normally I would use Hadley Wickham’s rvest
package to scrape websites with R, but I found that this method did not work: I could get the item names, but not the item ratings.
library(rvest)
read_html(THE_GOULET_PEN_URL) %>%
html_nodes("div.mz-productlisting") %>%
html_text()
I was confused.
When I open the URL in a browser and inspect the elements, I could clearly see the ratings embedded in the HTML. And, not knowing any better, I then proceeded to try python’s Beautiful Soup.
from bs4 import BeautifulSoup
from urllib import urlopen
f = urllib.urlopen(THE_GOULET_PEN_URL)
soup = BeautifulSoup(f)
soup.findAll("div", class_="mz-productlisting")[0].get_text()
Still nothing.
Eventually – after opening the URL in a javascript disabled browser and not seeing any ratings – I realized the ratings were dynamically generating by javascript.
Now that I knew what to look for (and finding Ross Kippenbrock’s post on Shot Blocking in the NHL Playoffs), I was able to scrape the javascript generated content using selenium
and chromedriver
.