I’ve done a lot of web scraping in my life. Recently, I started using Google Cloud Platform to automate these kind of jobs. In this blog post I explain how to do it. Throughout this article I scrape the comments, categories and tags of the top clips on the popular erotic website pornhub.com as an example. Why? Because I can.
The architecture
Here’s the architecture of the application:
- Cloud Scheduler triggers a cloud function that collects all the URLs of a particular video list (for example, the top 25 pages), and streams them to a Cloud Pub/Sub topic.
- Each message within that topic triggers a function that reads the URL from the message, scrapes that page and sends all the comments, categories, tags and video information to a Cloud Pub/Sub topic.
- A DataFlow Job is subscribed to a topic and writes all the information within these messages to Google BigQuery.
Who is this tutorial written for?
Data scientists with an interest in data engineering, cloud computing and basic knowledge of web architecture, version control and shell scripting. Like myself, I guess.
Need to catch up?
- Tutorial on batch scripting for Windows. Or shell scripting for UNIX systems.
- Tutorial on web scraping using Python.
- Tutorial on version control.
Our working environment
First things first, let’s get our environment set up.
- If you haven’t already, install the Google Cloud SDK
- We will be deploying from our local computer to Google Cloud Platform. Create a Google Cloud project and register a GCP service account.
- Open Visual Studio Code, or your favorite Python editor, and set up a virtual environment.
- If you want to use my code, clone my GitHub repo.
Cloud Functions for web scraping
Our first function, which collects all the URLs, uses a HTTP trigger. You’ll be able to trigger your function manually by browsing to it’s address, or by using the Cloud Scheduler, which is completely analogous to a cronjob. This is the template for this kind of function:
def hello_http(request):
"""HTTP Cloud Function.
Args:
request (flask.Request): The request object.
<http://flask.pocoo.org/docs/1.0/api/#flask.Request>
Returns:
The response text, or any set of values that can be turned into a
Response object using `make_response`
<http://flask.pocoo.org/docs/1.0/api/#flask.Flask.make_response>.
"""
request_json = request.get_json(silent=True)
request_args = request.args
if request_json and 'name' in request_json:
name = request_json['name']
elif request_args and 'name' in request_args:
name = request_args['name']
else:
name = 'World'
return 'Hello {}!'.format(escape(name))
If you go through my code, you’ll see that the web scraping is pretty straightforward. The function scrape_urls() processes the request, and will scrape the top X pages using the BeautifulSoup package, where X can be passed through the request. For example ?pages=25 tells the function to scrape the video urls of the first 25 pages of most popular videos. The publish() function publishes the urls to the ‘phub-url’ topic.
Scraping the information from the video page is less straightforward, but should be easy understandable if your Python skills are ok. What happens is that the video information, categories, tags and comments are stored in different lists and passed to the publish() function. All messages get stored in a JSON object and encoded to UTF-8. Finally, each message type gets published to another topic.
if messageType == 'comment':
project_id = "<your project name>"
topic_name = "phub-comment"
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(project_id, topic_name)
for message in messages:
data = {'videoId': videoId, 'comment': str(message[0]), 'user': str(message[1])}
data_json = json.dumps(data)
data_json = data_json.encode('utf-8')
future = publisher.publish(
topic_path, data=data_json
)
We can test a function locally, by running it as a flask application. For example, I added this chunk of code to the scrape_urls() function file.
if __name__ == '__main__':
from flask import Flask, request
app = Flask(__name__)
# option 1
@app.route('/', methods=['POST', 'GET'])
def test():
return scrape_urls(request)
# option 2
app.add_url_rule('/scrape_urls', 'scrape_urls', scrape_urls, methods=['POST', 'GET'], defaults={'request': request})
app.run(host='127.0.0.1', port=8088, debug=True)
By running the file using the command python main.py in your terminal, located in the folder of the file, you’ll be able to test the function on your machine. By browsing to 127.0.0.1:8088 you will trigger the function. If you see the message ‘Scraped all URLs.’, it means your function is working properly.
Finally, you can deploy your cloud function from the folder where your functions is located. If your main.py is in the ‘/scraping/scrape_urls’ folder, you should be located at ‘scrape_urls’. Then, run the following command. Deploying the other function is completely analogous.
gcloud functions deploy scrape_urls --runtime python37 --trigger-http
BigQuery for storing the data
Creating datasets in BigQuery is fairly straightforward. Although you can use gcloud or the BigQuery API for Python, you can achieve it fairly quick through the BigQuery interface.
Next, create a data set.
Next, create the necessary tables.
For the comments table, I created the following fields.
Once you’re done here with creating all the tables, we can go to the next section.
Using DataFlow for streaming the data into BigQuery
DataFlow is a GCP service thats runs Apache Beam programs. Google provides some templates of the box. We will use one of these templates to pick up the messages in Pub/Sub and stream them real-time into our Google BigQuery dataset.
Name your job, select your closest region, and go for the “Cloud Pub/Sub Topic to BigQuery”. The required parameters are:
- The Pub/Sub topic
- The BigQuery output table you want to stream each message in this topic to.
- Finally, select the bucket (tutorial if you haven’t created one yet) where you want to store the Beam programs and the temportary files in.
For every topic (videos, comments, categories and tags) you need to create a separate DataFlow program. Once you got the DataFlow programs set up, it will pick up all the messages published on the respective topics. It will stream the data into the BigQuery tables.
Putting it all together
This is what we have done:
- We have deployed our Cloud Functions that can scrape the web pages from pornhub.com and stream it to Pub/Sub
- We have created the dataset and tables in Google BigQuery, where the scraped data can be stored.
- We have set up all the DataFlow jobs
The only thing left is for you to start the Cloud Functions to start up the whole pipeline. There’s two things you can do:
- You can configure Cloud Scheduler to trigger the scrape_url function. Romin Irani has written a great tutorial on configuring Cloud Scheduler to trigger Cloud Functions.
- You can start the scrape_url function by browsing to its URL to run the function manually.
Great success!
Hi,
Thanks for sharing, but what are benefits from using Dataflow? Why do you need this middleman?
As I know, GCP allows you to stream directly from Cloud Function to BigQuery (proof https://cloud.google.com/architecture/streaming-data-from-cloud-storage-into-bigquery-using-cloud-functions).
Thanks,
Gregory
Hi Gregory, when I wrote this article. BigQuery’s streaming API didn’t exist in its current form.
Pingback: Backlinks
Keep on working, great job!
jydoll どこで情報を入手しているのかわかりませんが、すばらしいトピックです。もっと勉強したり、もっと理解したりするのに少し時間をかける必要があります。