Brooke Watson: Scraping Javascript websites in R
2018-03-18
1 Background: remixing packages in R
Open source software is made for remixing. When I first switched from STATA to R, I was comfortable using predefined packages and commands, but it quickly became apparent that R’s appeal lies in the power to write custom functions and packages. What’s more, because R is open source, these packages don’t have to be built from scratch. They’re best when they sample from others.
When I saw Rasmus Baath’s amazing beepr package retweeted, I knew I wanted to sample it. Beepr includes one function, beep()
, which plays a sound when a script is done running. It’s immediately useful to me, as I am constantly running short 2-5 minute jobs, but getting distracted and spending 30 minutes away from my code because I don’t realize it’s done. Beepr’s built in sounds are pretty fun – beep("mario")
and beep("treasure")
play old-school video game celebrations, and you can include html links to wav files to play any .wav that exists on the internet.
For my beepr remix, I wanted to use ad libs from rap songs. I often want to shout “GUCCI” or “WE THE BEST” when a long script is done, but I have over the years come to understand that “most people” don’t “appreciate” this kind of action in a “workplace environment.” I can settle for letting DJ Khaled and Gucci Mane shout them for me.
If these had been on the internet in .wav
form, I probably wouldn’t have spent any time learning how to scrape audio files from the internet and build them into a custom package. But they weren’t. Thus, BRRR was born.
It can be installed and run with the following command:
devtools::install_github("brooke-watson/BRRR")
library(BRRR)
# play a simple rap adlib in R
skrrrahh()
For background on what BRRR does and how it got it’s name, the README is quite comprehensive. Modifying beepr to include different sounds was actually quite straightforward - getting the data was the interesting part. Here, I’ll walk through how I scraped a JavaScript website, extracted and downloaded over 300 mp3 files, and hosted them in a package on Github.