Home ยป Web scraping with Google Cloud Functions, Pub/Sub, DataFlow & BigQuery

Web scraping with Google Cloud Functions, Pub/Sub, DataFlow & BigQuery

  • by
web-scraping-cloud-functions
Want to do a random act of kindness? Share this post.

I’ve done a lot of web scraping in my life. Recently, I started using Google Cloud Platform to automate these kind of jobs. In this blog post I explain how to do it. Throughout this article I scrape the comments, categories and tags of the top clips on the popular erotic website pornhub.com as an example. Why? Because I can.

The architecture

Here’s the architecture of the application:

  1. Cloud Scheduler triggers a cloud function that collects all the URLs of a particular video list (for example, the top 25 pages), and streams them to a Cloud Pub/Sub topic.
  2. Each message within that topic triggers a function that reads the URL from the message, scrapes that page and sends all the comments, categories, tags and video information to a Cloud Pub/Sub topic.
  3. A DataFlow Job is subscribed to a topic and writes all the information within these messages to Google BigQuery.

Who is this tutorial written for?

Data scientists with an interest in data engineering, cloud computing and basic knowledge of web architecture, version control and shell scripting. Like myself, I guess.

Need to catch up?

Our working environment

First things first, let’s get our environment set up.

Cloud Functions for web scraping

Our first function, which collects all the URLs, uses a HTTP trigger. You’ll be able to trigger your function manually by browsing to it’s address, or by using the Cloud Scheduler, which is completely analogous to a cronjob. This is the template for this kind of function:

def hello_http(request):
    """HTTP Cloud Function.
    Args:
        request (flask.Request): The request object.
        <http://flask.pocoo.org/docs/1.0/api/#flask.Request>
    Returns:
        The response text, or any set of values that can be turned into a
        Response object using `make_response`
        <http://flask.pocoo.org/docs/1.0/api/#flask.Flask.make_response>.
    """
    request_json = request.get_json(silent=True)
    request_args = request.args

    if request_json and 'name' in request_json:
        name = request_json['name']
    elif request_args and 'name' in request_args:
        name = request_args['name']
    else:
        name = 'World'
    return 'Hello {}!'.format(escape(name))

If you go through my code, you’ll see that the web scraping is pretty straightforward. The function scrape_urls() processes the request, and will scrape the top X pages using the BeautifulSoup package, where X can be passed through the request. For example ?pages=25 tells the function to scrape the video urls of the first 25 pages of most popular videos. The publish() function publishes the urls to the ‘phub-url’ topic.

Scraping the information from the video page is less straightforward, but should be easy understandable if you’re Python skills are ok. What happens is that the video information, categories, tags and comments are stored in different lists and passed to the publish() function. All messages get stored in a JSON object and encoded to UTF-8. Finally, each message type gets published to another topic.

if messageType == 'comment':
    project_id = "<your project name>"
    topic_name = "phub-comment"
    publisher = pubsub_v1.PublisherClient()
    topic_path = publisher.topic_path(project_id, topic_name)
    for message in messages:
        data = {'videoId': videoId, 'comment': str(message[0]), 'user': str(message[1])}
        data_json = json.dumps(data)
        data_json = data_json.encode('utf-8')
        future = publisher.publish(
            topic_path, data=data_json
        )

We can test a function locally, by running it as a flask application. For example, I added this chunk of code to the scrape_urls() function file.

if __name__ == '__main__':
    from flask import Flask, request
    app = Flask(__name__)

    # option 1
    @app.route('/', methods=['POST', 'GET'])
    def test():
        return scrape_urls(request)

    # option 2
    app.add_url_rule('/scrape_urls', 'scrape_urls', scrape_urls, methods=['POST', 'GET'], defaults={'request': request})

    app.run(host='127.0.0.1', port=8088, debug=True)

By running the file using the command python main.py in your terminal, located in the folder of the file, you’ll be able to test the function on your machine. By browsing to 127.0.0.1:8088 you will trigger the function. If you see the message ‘Scraped all URLs.’, it means your function is working properly.

Finally, you can deploy your cloud function from the folder where your functions is located. If your main.py is in the ‘/scraping/scrape_urls’ folder, you should be located at ‘scrape_urls’. Then, run the following command. Deploying the other function is completely analogous.

gcloud functions deploy scrape_urls --runtime python37 --trigger-http

BigQuery for storing the data

Creating datasets in BigQuery is fairly straightforward. Although you can use gcloud or the BigQuery API for Python, you can achieve it fairly quick through the BigQuery interface.

Next, create a data set.

Next, create the necessary tables.

For the comments table, I created the following fields.

Once you’re done here with creating all the tables, we can go to the next section.

Using DataFlow for streaming the data into BigQuery

DataFlow is a GCP service thats runs Apache Beam programs. Google provides some templates of the box. We will use one of these templates to pick up the messages in Pub/Sub and stream them real-time into our Google BigQuery dataset.

Name your job, select your closest region, and go for the “Cloud Pub/Sub Topic to BigQuery”. The required parameters are:

  • The Pub/Sub topic
  • The BigQuery output table you want to stream each message in this topic to.
  • Finally, select the bucket (tutorial if you haven’t created one yet) where you want to store the Beam programs and the temportary files in.

For every topic (videos, comments, categories and tags) you need to create a separate DataFlow program. Once you got the DataFlow programs set up, it will pick up all the messages published on the respective topics. It will stream the data into the BigQuery tables.

Putting it all together

This is what we have done:

  • We have deployed our Cloud Functions that can scrape the web pages from pornhub.com and stream it to Pub/Sub
  • We have created the dataset and tables in Google BigQuery, where the scraped data can be stored.
  • We have set up all the DataFlow jobs

The only thing left is for you to start the Cloud Functions to start up the whole pipeline. There’s two things you can do:

Great success!

Want to do a random act of kindness? Share this post.