Web scraping with Google Cloud Functions, Pub/Sub, DataFlow & BigQuery

I’ve done a lot of web scraping in my life. Recently, I started using Google Cloud Platform to automate these kind of jobs. In this blog post I explain how to do it. Throughout this article I scrape the comments, categories and tags of the top clips on the popular erotic website pornhub.com as an example. Why? Because I can.

The architecture

Here’s the architecture of the application:

Cloud Scheduler triggers a cloud function that collects all the URLs of a particular video list (for example, the top 25 pages), and streams them to a Cloud Pub/Sub topic.
Each message within that topic triggers a function that reads the URL from the message, scrapes that page and sends all the comments, categories, tags and video information to a Cloud Pub/Sub topic.
A DataFlow Job is subscribed to a topic and writes all the information within these messages to Google BigQuery.

Who is this tutorial written for?

Data scientists with an interest in data engineering, cloud computing and basic knowledge of web architecture, version control and shell scripting. Like myself, I guess.

Need to catch up?

Tutorial on batch scripting for Windows. Or shell scripting for UNIX systems.
Tutorial on web scraping using Python.
Tutorial on version control.

Our working environment

First things first, let’s get our environment set up.

If you haven’t already, install the Google Cloud SDK
We will be deploying from our local computer to Google Cloud Platform. Create a Google Cloud project and register a GCP service account.
Open Visual Studio Code, or your favorite Python editor, and set up a virtual environment.
If you want to use my code, clone my GitHub repo.

Cloud Functions for web scraping

Our first function, which collects all the URLs, uses a HTTP trigger. You’ll be able to trigger your function manually by browsing to it’s address, or by using the Cloud Scheduler, which is completely analogous to a cronjob. This is the template for this kind of function:

def hello_http(request):
    """HTTP Cloud Function.
    Args:
        request (flask.Request): The request object.
        <http://flask.pocoo.org/docs/1.0/api/#flask.Request>
    Returns:
        The response text, or any set of values that can be turned into a
        Response object using `make_response`
        <http://flask.pocoo.org/docs/1.0/api/#flask.Flask.make_response>.
    """
    request_json = request.get_json(silent=True)
    request_args = request.args

    if request_json and 'name' in request_json:
        name = request_json['name']
    elif request_args and 'name' in request_args:
        name = request_args['name']
    else:
        name = 'World'
    return 'Hello {}!'.format(escape(name))

If you go through my code, you’ll see that the web scraping is pretty straightforward. The function scrape_urls() processes the request, and will scrape the top X pages using the BeautifulSoup package, where X can be passed through the request. For example ?pages=25 tells the function to scrape the video urls of the first 25 pages of most popular videos. The publish() function publishes the urls to the ‘phub-url’ topic.

Scraping the information from the video page is less straightforward, but should be easy understandable if your Python skills are ok. What happens is that the video information, categories, tags and comments are stored in different lists and passed to the publish() function. All messages get stored in a JSON object and encoded to UTF-8. Finally, each message type gets published to another topic.

if messageType == 'comment':
    project_id = "<your project name>"
    topic_name = "phub-comment"
    publisher = pubsub_v1.PublisherClient()
    topic_path = publisher.topic_path(project_id, topic_name)
    for message in messages:
        data = {'videoId': videoId, 'comment': str(message[0]), 'user': str(message[1])}
        data_json = json.dumps(data)
        data_json = data_json.encode('utf-8')
        future = publisher.publish(
            topic_path, data=data_json
        )

We can test a function locally, by running it as a flask application. For example, I added this chunk of code to the scrape_urls() function file.

if __name__ == '__main__':
    from flask import Flask, request
    app = Flask(__name__)

    # option 1
    @app.route('/', methods=['POST', 'GET'])
    def test():
        return scrape_urls(request)

    # option 2
    app.add_url_rule('/scrape_urls', 'scrape_urls', scrape_urls, methods=['POST', 'GET'], defaults={'request': request})

    app.run(host='127.0.0.1', port=8088, debug=True)

By running the file using the command python main.py in your terminal, located in the folder of the file, you’ll be able to test the function on your machine. By browsing to 127.0.0.1:8088 you will trigger the function. If you see the message ‘Scraped all URLs.’, it means your function is working properly.

Finally, you can deploy your cloud function from the folder where your functions is located. If your main.py is in the ‘/scraping/scrape_urls’ folder, you should be located at ‘scrape_urls’. Then, run the following command. Deploying the other function is completely analogous.

gcloud functions deploy scrape_urls --runtime python37 --trigger-http

BigQuery for storing the data

Creating datasets in BigQuery is fairly straightforward. Although you can use gcloud or the BigQuery API for Python, you can achieve it fairly quick through the BigQuery interface.

Next, create a data set.

Next, create the necessary tables.

For the comments table, I created the following fields.

Once you’re done here with creating all the tables, we can go to the next section.

Using DataFlow for streaming the data into BigQuery

DataFlow is a GCP service thats runs Apache Beam programs. Google provides some templates of the box. We will use one of these templates to pick up the messages in Pub/Sub and stream them real-time into our Google BigQuery dataset.

Name your job, select your closest region, and go for the “Cloud Pub/Sub Topic to BigQuery”. The required parameters are:

The Pub/Sub topic
The BigQuery output table you want to stream each message in this topic to.
Finally, select the bucket (tutorial if you haven’t created one yet) where you want to store the Beam programs and the temportary files in.

For every topic (videos, comments, categories and tags) you need to create a separate DataFlow program. Once you got the DataFlow programs set up, it will pick up all the messages published on the respective topics. It will stream the data into the BigQuery tables.

Putting it all together

This is what we have done:

We have deployed our Cloud Functions that can scrape the web pages from pornhub.com and stream it to Pub/Sub
We have created the dataset and tables in Google BigQuery, where the scraped data can be stored.
We have set up all the DataFlow jobs

The only thing left is for you to start the Cloud Functions to start up the whole pipeline. There’s two things you can do:

You can configure Cloud Scheduler to trigger the scrape_url function. Romin Irani has written a great tutorial on configuring Cloud Scheduler to trigger Cloud Functions.
You can start the scrape_url function by browsing to its URL to run the function manually.

Great success!

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

5 thoughts on “Web scraping with Google Cloud Functions, Pub/Sub, DataFlow & BigQuery”

Gregory March 3, 2022 at 11:25 am

Hi,
Thanks for sharing, but what are benefits from using Dataflow? Why do you need this middleman?
As I know, GCP allows you to stream directly from Cloud Function to BigQuery (proof https://cloud.google.com/architecture/streaming-data-from-cloud-storage-into-bigquery-using-cloud-functions).

Thanks,
Gregory

1. roelpi March 3, 2022 at 12:30 pm
  
  Hi Gregory, when I wrote this article. BigQuery’s streaming API didn’t exist in its current form.
  
Pingback: Backlinks
exness March 13, 2024 at 9:49 pm

Keep on working, great job!

ラブドールリアル May 21, 2024 at 12:31 pm

jydoll どこで情報を入手しているのかわかりませんが、すばらしいトピックです。もっと勉強したり、もっと理解したりするのに少し時間をかける必要があります。

Web scraping with Google Cloud Functions, Pub/Sub, DataFlow & BigQuery

The architecture

Who is this tutorial written for?

Our working environment

Cloud Functions for web scraping

BigQuery for storing the data

Using DataFlow for streaming the data into BigQuery

Putting it all together

Say thanks, ask questions or give feedback

5 thoughts on “Web scraping with Google Cloud Functions, Pub/Sub, DataFlow & BigQuery”

Leave a Reply Cancel reply

Related Posts

How to do a SUMIF in PySpark

Check if Python logger already exists

Spark 3.0: Solving the “dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z” error