Warning: this blog post touches on a rather technical subject and is not aimed towards beginners. If you ever built a web scraping script, in R, or any other programming languagen, you know how long it can take to scrape a sizeable amount of web pages. The same goes for interacting with APIs. What if I told you there is a solution: asynchronous requests.
In this blog post, we will scrape the body content of 10 articles on the NBC News website. I have listed them here and loaded rvest, a popular R package for web scraping.
pages <- c('https://www.nbcnews.com/politics/2020-election/democrats-are-leading-polls-means-it-s-time-them-panic-n1239378', 'https://www.nbcnews.com/politics/2020-election/biden-says-he-spoke-jacob-blake-praises-family-s-resilience-n1239230', 'https://www.nbcnews.com/news/nbcblk/biden-arrive-kenosha-just-city-achieves-fragile-calm-n1239156', 'https://www.nbcnews.com/politics/2020-election/biden-trump-put-different-visions-vivid-display-back-campaign-trail-n1239259', 'https://www.nbcnews.com/politics/2020-election/democrats-requesting-absentee-ballots-outnumber-gop-key-swing-states-n1239361', 'https://www.nbcnews.com/politics/2020-election/trump-often-sees-american-landscape-losers-suckers-n1239304', 'https://www.nbcnews.com/politics/2020-election/war-veteran-democrats-slam-trump-over-report-he-called-u-n1239324', 'https://www.nbcnews.com/politics/congress/congresswoman-blocked-touring-mail-facility-postal-service-police-n1239359', 'https://www.nbcnews.com/politics/2020-election/trump-campaign-seeks-block-navajo-nation-voters-lawsuit-over-arizona-n1239328', 'https://www.nbcnews.com/politics/donald-trump/white-house-denies-report-claiming-trump-called-dead-american-soldiers-n1239267') library(rvest)
In the following chunk of code, we scrape these URLs, with a simple for loop and some functions we got from the rvest (and xml2) package.
responses <- list() bodies <- list() for (i in 1:length(pages)) { responses[[i]] <- read_html(pages[i]) bodies[[i]] <- html_text(html_node(responses[[i]],'.article-body__content')) }
Now, we only scraped ten URLs, and it already took a couple of seconds. Imagine scraping thousands of URLs: it takes hours and hours. There is, however, a faster way to make requests. Enter: asynchrony.
Asynchrony allows running code in parallel, independent of each other, and not blocking the sequence in which the code has been written. The modern web runs on asynchronous applications that allow interaction between client and server, without blocking the functioning of an application.
Making asynchronous requests isn’t new. The R community has been discussing this for years. Now that many R packages have matured, there are multiple solutions. I give you three.
Solution 1: the async package
This package is fairly new and brings “asynchronous computation and ‘I/O’ to R.” In my opinion it is the most elegant solution of the three. It has an unfamiliar syntax, but it allows you to easily chain the multiple steps to process the request. Each step reports a “deferred value” (e.g. the response doesn’t contain the promised value yet, it is simply reserved), which is evaluated lazily.
I couldn’t install the packages through install.packages, although it is supported. So I had to install the latest version directly from GitHub.
Here’s how it works:
- The async function creates an asynchronous function.
- In that function we use http_get, which starts a get request in the background. It returns a deferred value.
- By using then, we wait for the response to complete and proceed with parsing the HTML: converting the raw to a character, reading it in as HTML, identifying the content element in the DOM, and finally extracting the text inside it.
Evaluating an asynchronous expression is done through synchronise. We don’t want to evaluate one expression, we want to evaluate one for every page in our pages vector. We can do this via the async_map function. The result is a list of body content of the ten URLs we listed in the pages vector.
install.packages("remotes") remotes::install_github("r-lib/async") library(async) async_get <- async(function(url) { http_get(url)$ then(function(x) { rawToChar(x$content)})$ then(function(x) { read_html(x) } )$ then(function(x) { html_node(x, '.article-body__content')})$ then(function(x) { html_text(x) }) }) bodies <- synchronise(async_map(pages,async_get))
Solution 2: the curl package
curl is a piece of software that allows the transfer of information over networks (of various protocols). Of course, there is an R interface that shares the name: curl. It is maintained by some of the greatest names in the R community (Jeroen Ooms and Hadley Wickham).
- The new_pool function creates a pool of multiple curl handles. See a handle as a configuration of a request. Although we use really simple requests here, they can be highly customized.
- We create the bodies list, which will contain all the article content.
- We specify the done_function that will be processing the output of each HTTP request. It’s basically the same steps as in the first solution.
- Next, we use lapply and curl_fetch_multi to create 10 curl handles. We store these in the pool we created earlier.
- Finally, we run all requests of the pool and follow up each request with our done_function.
- As you can see, we assign the output to bodies in the “enclosing environment” (~global assignment). Be careful: it isn’t guaranteed that each URLs response is returned in the order that you requested them.
Once again, we gat a list of body content of the ten URLs we listed in the pages vector.
library(curl) pl <- new_pool() bodies <- list() done_function <- function(x) { bodies <<- append(bodies, html_text( html_node( read_html( rawToChar(x$content) ), '.article-body__content' ) ) ) } lapply(pages, function(x) { curl_fetch_multi(x, done = done_function, pool = pl) }) multi_run(pool = pl)
Solution 3: the crul package
The crul is built on curl but is exclusively focused on the HTTP(S) protocol. It has a somewhat weird syntax because you have to wrap parts of your code in brackets.
- The Async object is a client that can work with many URLs at once (our 10 URLs). It is stored in the requests variable.
- Next, we use the get method on the 10 URLs within our requests variables and store them in the responses variable.
- Finally, we process the output of each response using lapply and the rvest functions we’ve been using before.
library(crul) (requests <- Async$new( urls = pages )) (responses <- requests$get()) bodies <- lapply(responses, function(x) { html_text( html_node( read_html(x$parse('UTF-8')), '.article-body__content' ) ) } )
Caveats
It is recommended to use this for API calls, but not necessarily for something in the grey zone such as web scraping. If you produce thousands of asynchronous calls to a web server, there’s a real possibility your IP will be blacklisted. Of course, there’s some room for you to randomize your client parameters. Furthermore, if you have a VPN server or a proxy, you can make requests from multiple IP addresses. Nevertheless, you have been warned.
By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!
Thank you for your sharing. I am worried that I lack creative ideas. It is your article that makes me full of hope. Thank you. But, I have a question, can you help me? https://www.binance.info/pt-PT/join?ref=S5H7X3LP
Your point of view caught my eye and was very interesting. Thanks. I have a question for you.