The reason people love tools like Dataform so much, is that it allows them to automate parts of the ELT workflow. In this blog post we will set our destination dataset, depending on the branch we’re running the definition on.
A really interesting use case is to keep the resulting tables from your scheduled runs on your staging branch(es) separately from the tables created on your production branch. Many possibilities come to mind:
- Set a value in a column — e.g. “staging” and “production” (not within the scope of this blog post)
- Add a prefix or a suffix to your tables and views — e.g. “stg-tablename” and “prd-tablename”
- Use different BigQuery datasets (= Dataform schemas)
Without a doubt, the easiest way to get one of these three solutions done is via environments.
Dataform Environments
Environments are a wrapper around your codebase. Just like environment variables within an operating system or container, they allow you to manipulate and set variables that work through your code, everywhere you use them.
Let’s start with dataform.json: As you can see, I set the defaultSchema parameter to “stg”, which will be default BigQuery dataset where tables will be created or updated.
{ "warehouse": "bigquery", "defaultSchema": "stg", // The default dataet in bigquery is set to stg "assertionSchema": "dataform_assertions", "defaultDatabase": "YOUR_DATABASE" }
By creating an environment within environments.json, one can overwrite the settings from dataform.json, using the configOverride parameter. This is the environment named production, that I created for the master branch. When a job runs on this branch, output will not go to the “stg” dataset, but to the “prd” dataset.
{ "environments": [ { "name": "production", "configOverride": { "defaultSchema": "prd" }, "gitRef": "master" } ] }
Although you don’t need to set the schema explicitly, because we set the default in dataform.json, one can still do it. Within a definition, one can refer to the settings via the dataform object.
config { schema: dataform.projectConfig.defaultSchema, // Optional name: "YOUR_TABLE" }
From this example, it is clear how one can use custom variables to create code and query manipulations that depends on the branch or the environment.