Day59: R::rvest for webscraping

I’ll be teaching an R workshop in a couple days based on the R for Reproducible Scientific Analysis Software Carpentry lesson plan (lessons 1-10). To get a copy of the R commands (so that I can “try” them out), I could either (A) copy everything by hand or (B) automatically scrape the commands.

If you know me, then the choice is pretty obvious…

In this post, I scrape the R commands from the lesson plans using R and rvest. The script R Markdown file for this little project is here and the generated markdown file (containing the parsed results) is here.

Importance of having data

According to Yaser Abu-Mostafa from Learning from Data, when applying machine learning algorithms, you need 3 things:

the pattern must exist,
you cannot pin it down mathematically, and
you must have data

And without data, you can’t do any learning.

When working with data, data can come in many forms, such as in tables, from databases, API calls, JSON files, or webpages. In this post, the data is embedded on webpages and I want to obtain (“scrape”) this data for subsequent use.

rvest for web scraping

rvest is a simple web scraping package for R writting by Hadley Wickham.

library(rvest)

read_html() is used to read the html from a given url and html_nodes() is used to select the particular nodes of an HTML document. I approached the problem as follows:

Get the syllabus
Use the syllabus to find the child lesson plan pages
Parse out the R commands from the lesson plan pages

In this post, I focus on #3. For details on #1 and #2, checkout the R Markdown file.

Parsing information

When parsing for data, you need to identifying where in the HTML is the data. For instance, what are the HTML tags around the data? What is the “class” or “id” attribute?

Fortunately, the parsing of the R for Reproducible Scientific Analysis is easy as all the R commands are tagged by <div style="r highlighter-rouge"> elements.

# Define function to parse R commands
get_cmds <- function(href, is_tidy=TRUE) {
  child_url <- paste0(main_url, href)

  # Parse commands
  cmds <-
    read_html(child_url) %>%
    html_nodes("div.r,highlighter-rouge") %>%
    html_nodes("code") %>%
    html_text()

  # Make it so print out is pretty
  if (is_tidy) {
    lapply(cmds, function(cmd){
      paste0("    ", cmd) %>%
      {gsub("\n", "\n    ", .)} %>%
      {gsub("\n    $", "\n", .)}
    }) %>%
      unlist()
  } else {
    cmds
  }
}

With the function to do the parsing defined, I can next purrr::map this function to each of the lesson plan pages and then display the output i.e. parsed out R commands from the lesson plan.

# Get/parse R commands from each lesson link
syllabus %>%
  mutate(
    cmds = map(href, get_cmds)
  )

Success!

For the parsed out R commands, checkout the Markdown file.