Site icon Roel Peters

Three ways to make asynchronous GET requests in R

Warning: this blog post touches on a rather technical subject and is not aimed towards beginners. If you ever built a web scraping script, in R, or any other programming languagen, you know how long it can take to scrape a sizeable amount of web pages. The same goes for interacting with APIs. What if I told you there is a solution: asynchronous requests.

In this blog post, we will scrape the body content of 10 articles on the NBC News website. I have listed them here and loaded rvest, a popular R package for web scraping.

pages <- c('https://www.nbcnews.com/politics/2020-election/democrats-are-leading-polls-means-it-s-time-them-panic-n1239378',
           'https://www.nbcnews.com/politics/2020-election/biden-says-he-spoke-jacob-blake-praises-family-s-resilience-n1239230',
           'https://www.nbcnews.com/news/nbcblk/biden-arrive-kenosha-just-city-achieves-fragile-calm-n1239156',
           'https://www.nbcnews.com/politics/2020-election/biden-trump-put-different-visions-vivid-display-back-campaign-trail-n1239259',
           'https://www.nbcnews.com/politics/2020-election/democrats-requesting-absentee-ballots-outnumber-gop-key-swing-states-n1239361',
           'https://www.nbcnews.com/politics/2020-election/trump-often-sees-american-landscape-losers-suckers-n1239304',
           'https://www.nbcnews.com/politics/2020-election/war-veteran-democrats-slam-trump-over-report-he-called-u-n1239324',
           'https://www.nbcnews.com/politics/congress/congresswoman-blocked-touring-mail-facility-postal-service-police-n1239359',
           'https://www.nbcnews.com/politics/2020-election/trump-campaign-seeks-block-navajo-nation-voters-lawsuit-over-arizona-n1239328',
           'https://www.nbcnews.com/politics/donald-trump/white-house-denies-report-claiming-trump-called-dead-american-soldiers-n1239267')

library(rvest)

In the following chunk of code, we scrape these URLs, with a simple for loop and some functions we got from the rvest (and xml2) package.

responses <- list()
bodies <- list()
for (i in 1:length(pages)) {
  responses[[i]] <- read_html(pages[i])
  bodies[[i]] <- html_text(html_node(responses[[i]],'.article-body__content'))
}

Now, we only scraped ten URLs, and it already took a couple of seconds. Imagine scraping thousands of URLs: it takes hours and hours. There is, however, a faster way to make requests. Enter: asynchrony.

Asynchrony allows running code in parallel, independent of each other, and not blocking the sequence in which the code has been written. The modern web runs on asynchronous applications that allow interaction between client and server, without blocking the functioning of an application.

Making asynchronous requests isn’t new. The R community has been discussing this for years. Now that many R packages have matured, there are multiple solutions. I give you three.

Solution 1: the async package

This package is fairly new and brings “asynchronous computation and ‘I/O’ to R.” In my opinion it is the most elegant solution of the three. It has an unfamiliar syntax, but it allows you to easily chain the multiple steps to process the request. Each step reports a “deferred value” (e.g. the response doesn’t contain the promised value yet, it is simply reserved), which is evaluated lazily.

I couldn’t install the packages through install.packages, although it is supported. So I had to install the latest version directly from GitHub.

Here’s how it works:

Evaluating an asynchronous expression is done through synchronise. We don’t want to evaluate one expression, we want to evaluate one for every page in our pages vector. We can do this via the async_map function. The result is a list of body content of the ten URLs we listed in the pages vector.

install.packages("remotes")
remotes::install_github("r-lib/async")
library(async)

async_get <- async(function(url) {
  http_get(url)$
    then(function(x) { rawToChar(x$content)})$
    then(function(x) { read_html(x) } )$
    then(function(x) { html_node(x, '.article-body__content')})$
    then(function(x) { html_text(x) })
})

bodies <- synchronise(async_map(pages,async_get))

Solution 2: the curl package

curl is a piece of software that allows the transfer of information over networks (of various protocols). Of course, there is an R interface that shares the name: curl. It is maintained by some of the greatest names in the R community (Jeroen Ooms and Hadley Wickham).

Once again, we gat a list of body content of the ten URLs we listed in the pages vector.

library(curl)
pl <- new_pool()
bodies <- list()

done_function <- function(x) { 
  bodies <<- append(bodies, 
                    html_text(
                      html_node(
                        read_html(
                          rawToChar(x$content)
                        ), 
                        '.article-body__content'
                      )
                    )
                   )
}
lapply(pages, function(x) { curl_fetch_multi(x, done = done_function, pool = pl) })

multi_run(pool = pl)

Solution 3: the crul package

The crul is built on curl but is exclusively focused on the HTTP(S) protocol. It has a somewhat weird syntax because you have to wrap parts of your code in brackets.

library(crul)
(requests <- Async$new(
  urls = pages
))

(responses <- requests$get())
bodies <- lapply(responses, 
                 function(x) { 
                   html_text(
                     html_node(
                       read_html(x$parse('UTF-8')),
                       '.article-body__content'
                     )
                   )
                 }
               )

Caveats

It is recommended to use this for API calls, but not necessarily for something in the grey zone such as web scraping. If you produce thousands of asynchronous calls to a web server, there’s a real possibility your IP will be blacklisted. Of course, there’s some room for you to randomize your client parameters. Furthermore, if you have a VPN server or a proxy, you can make requests from multiple IP addresses. Nevertheless, you have been warned.

By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!

Exit mobile version