I don’t think there is a project I have been wanting to do as long as this one. Although there was a pandemic and accompanying lockdown this year, I still haven’t had the time to tackle this one — became a dad and all. But I had the weekend for myself, so I decided to go ahead and finally build my own (prototype of a) web analytics solution. Bonus: I didn’t use my credit card, at all.
There is no better time than now to get into data pipelines. Many complex tools that required a lot of technical skills have matured to SaaS tools that allow any script-kiddie to make a data pipeline. If you want to build a pipeline yourself, consult the Big Data Ecosystem Browser to find the right data tools.
Part I: set up a tag to capture events with page and device variables
In this first blog post, you’ll learn how to create a tag in Google Tag Manager that will allow you to capture page views and events with a wide range of page and device dimensions. The variables and dimensions that are configured in this tag determine what can be reported on. Although they are not as exhaustive as the dimensions and variables that are available in Google Analytics, it should be enough for generating basic digital marketing and digital product insights.
Any tag management solution could be used. However, unlike other tag management tools, GTM is free and is widely supported within the web analytics community.
I remind you that this first blog post is open-ended. It stops where the next blog post begins, but I wanted to make it very clear that you can plug any API/webhook that accepts POST or GET requests into this tag.
Part II: set up a webhook to capture and store events in a database
In this blog post, I explain how to set up Fivetran and ElephantSQL to capture and store events. Traditionally, Fivetran is an integrator that focusses on migrating data from a database or SaaS tool to a data warehouse. But it has a webhook connector that can be used to collect the events I described in the previous blog post.
Once again, you can replace Fivetran with a Function-as-a-Service from GCP, AWS or Azure, or a message bus/queue such as RabbitMQ or Kafka. I chose Fivetran because it’s super simple to use and it offers a free trial.
Part III: create reports on your collected data
The last part of this story is about visualizing the data. There are over hundred dashboarding, reporting and visualization tools, yet I chose to go with Mode. It’s been on my radar for a couple of months and raised 33 million in a Series D less than three months ago. It offers a free plan with some very royal limits so nothing stopped me to check out what it’s all about.
If you want to use the query used in the blog post, here you have it:
SELECT _created as TS, EXTRACT(hour from _created) as HOUR_OF_DAY, data ->> 'ga' as GA_USERID, data ->> 'userAgent' as USER_AGENT, data ->> 'operatingSystem' as OS, data ->> 'browser' as BROWSER, data ->> 'browserVersion' as BROWSER_VERSION, data ->> 'referrer' as REFERRER, data ->> 'deviceCategory' as DEVICE_CATEGORY, data ->> 'protocol' as PROTOCOL, data ->> 'hostname' as HOSTNAME, data ->> 'pagePath' as PAGE_PATH FROM "webhooks"."fivetran" WHERE data ->> 'referrer' <> 'https://www.roelpeters.be' AND _created > '2020-10-02' AND data ->> 'hostname' = 'www.roelpeters.be' ORDER BY _created DESC
If you’re building your own web analytics data pipeline, there are some great tools that offer an all-in-one solution. E.g. Snowplow is a data collection platform with out-of-the-box schemas (e.g. to mimic Google Analytics) that you can host in your own cloud environment.
However, nothing stops the hobbyist in you to build something yourselves. The current data landscape is like a huge lego box that you can mix and match to your preferences. Want to find the proper tool for a specific task? Feel free to try our Big Data Ecosystem Browser.
Very helpful article. Thank you.
Unfortunately, your link to Big Data Ecosystem Browser doesn’t work for me. Could you please point actual www to it.