2 Javascript webscraping in R

# import packages 
library(rvest)
library(stringr)
library(tidyverse)
library(purrr)
library(here)
library(beepr)
library(DT)

Getting the audio files entailed a fair bit of preliminary work. I had to:

  1. Scrape TheRapBoard to find paths to all of the audio files.
  2. Download mp3 files from the website.
  3. Convert them from .mp3 to .wav format.
  4. Pare down from ~300 to ~50 sounds to keep the package (relatively) light.
  5. Put them in a folder that R would recognize and bundle into my package.

A note: web scraping can be a tremendously useful way to extract data from the internet, but it can cause real problems for some websites and should be done respectfully and ethically. This post from James Densmore lays out some guidelines for doing this responsibly. Before I did anything, I checked to see whether The Rap Board had a robots.txt file that prevented or provided specific instructions on how to scrape the site. I recommend doing this before any web scraping project - and keeping that in mind if you’re thinking of reproducing this script.

2.1 Download PhantomJS using homebrew

Httr and rvest are the two R packages that work together to scrape html websites. Usually, this works by using a browser extension called SelectorGadget to find all items styled with a particular CSS - actors in an IMDB table, for example. For more, check out the SelectorGadget vignette:

if(!require(rvest)) {
  install.packages("rvest")
  library(rvest)
}
vignette("selectorgadget")

Unfortunately, this didn’t work for the website I wanted to scrape, which was written primarily in JavaScript. Instead, I adapted Florian Teschner’s instructions on using PhantomJS to convert the website into HTML. I wrapped this in a system() call inside R Studio, but it could just as easily be done from the command line.

Before we can do anything, we need to download and unzip PhantomJS. This can be done from the link, but if you have a Mac and insist on staying inside RStudio, below is some circuitous R code you can use to do just that. It first downloads Homebrew, if you don’t have it yet, and then downloads PhantomJS. Homebrew is an easy way to install packages onto a Mac from the terminal. PhantomJS calls itself “a headless WebKit scriptable with a JavaScript API”, which for our purposes means that it will convert a JavaScript website like Rap Board into html. This means we can get the paths to the .mp3 soundboard files really easily.

# donwload homebrew if it doesn't already exist 
if(!dir.exists("/usr/local/Homebrew")) {
  system('ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"')
}

# download phantomjs using homebrew 
if(!dir.exists("/usr/local/Cellar/phantomjs")) {
    system("brew install phantomjs") 
}

2.2 Writing scrape.js

Next, we’ll write the JavaScript code to a js file called scrape.js. If you want to scrape a different JavaScript URL, we can change the path in the next function.

# write the javascript code to a new file, scrape.js

writeLines("var url = 'http://therapboard.com';
var page = new WebPage();
var fs = require('fs');

page.open(url, function (status) {
        just_wait();
});

function just_wait() {
    setTimeout(function() {
               fs.write('1.html', page.content, 'w');
            phantom.exit();
    }, 2500);
}
", con = "scrape.js")

2.3 Scraping TheRapBoard.com

This function takes scrape.js and the url of our choice (in this case, the url that hosts the audio files we need) and calls PhantomJS from the command line on a Mac. If you didn’t download PhantomJS using homebrew, you’ll need to include the path to your downloaded PhantomJS package as a phantompath argument. If you use Windows, this also is going to look different.

js_scrape <- function(url = "http://therapboard.com", 
                      js_path = "scrape.js", 
                      phantompath = "/usr/local/Cellar/phantomjs/2.1.1/bin/phantomjs"){
  
  # this section will replace the url in scrape.js to whatever you want 
  lines <- readLines(js_path)
  lines[1] <- paste0("var url ='", url ,"';")
  writeLines(lines, js_path)
  
  command = paste(phantompath, js_path, sep = " ")
  system(command)

}

js_scrape()

2.4 Extracting audio files

After converting the rap board’s website from Javascript into html, I could use rvest and dplyr package functions to get the mp3 paths into a format that I wanted. The code below required some fiddling with stringr and regex to convert a jumble of html into a list of file paths. It could be more succinct, but it works.

# read the newly created html file 
html <- read_html("1.html")

setup <- html %>% 
  html_nodes("source") %>% 
  str_c("") %>% 
  as_tibble() %>% 
  filter(!str_detect(value, 'ogg"')) %>%  
  lapply(., str_replace, '<source src=\"', "http://therapboard.com/") %>% 
  lapply(., str_split, "\" type")

mp3s = map(seq_along(setup$value), ~setup$value[[.x]][1]) 

2.5 Downloading mp3s

Finally, I had the list of mp3 paths. I wrote a function to download the urls into an mp3s/ folder that I created inside the function. I used Sys.sleep() to introduce a random lag in between each download, which I hear is best practice.

download_mp3s = function(url) { 
  if(!dir.exists("mp3s")) {dir.create("mp3s")}
  # create a place to put them if you haven't yet 
  
  url = url
  destpath = stringr::str_replace(url, "http://therapboard.com/audio/", "mp3s/")
  download.file(url, destfile = destpath)
  Sys.sleep(sample(seq(1, 3, by=0.001), 1))
} 

Then, I waited. Fittingly, I used beepr to alert me when my script was done.

lapply(mp3s, download_mp3s)
beep("mario")