Deploying dbt Projects at Scale on Google Cloud

Author:Murphy | View: 29079 | Time: 2025-03-22 20:38:39

Managing data models at scale is a common challenge for data teams using dbt (data build tool). Initially, teams often start with simple models that are easy to manage and deploy. However, as the volume of data grows and business needs evolve, the complexity of these models increases.

This progression often leads to a monolithic repository where all dependencies are intertwined, making it difficult for different teams to collaborate efficiently. To address this, data teams may find it beneficial to distribute their data models across multiple dbt projects. This approach not only promotes better organisation and modularity but also enhances the scalability and maintainability of the entire data infrastructure.

One significant complexity introduced by handling multiple dbt projects is the way they are executed and deployed. Managing library dependencies becomes a critical concern, especially when different projects require different versions of dbt. While dbt Cloud offers a robust solution for scheduling and executing multi-repo dbt projects, it comes with significant investments that not every organisation can afford or find reasonable. A common alternative is to run dbt projects using Cloud Composer, Google Cloud's managed Apache Airflow service.

Cloud Composer provides a managed environment with a substantial set of pre-defined dependencies. However, based on my experience, this setup poses a significant challenge. Installing any Python library without encountering unresolved dependencies is often difficult. When working with dbt-core, I found that installing a specific version of dbt within the Cloud Composer environment was nearly impossible due to conflicting version dependencies. This experience highlighted the difficulty of running any dbt version on Cloud Composer directly.

Containerisation offers an effective solution. Instead of installing libraries within the Cloud Composer environment, you can containerise your dbt projects using Docker images and run them on Kubernetes via Cloud Composer. This approach keeps your Cloud Composer environment clean while allowing you to include any required libraries within the Docker image. It also provides the flexibility to run different dbt projects on various dbt versions, addressing dependency conflicts and ensuring seamless execution and deployment.

With the complexities of managing multiple dbt projects addressed, we now move on to the technical implementation of deploying these projects at scale on Google Cloud. The diagram below outlines the process of containerising dbt projects, storing the Docker images in Artifact Registry, and automating the deployment with GitHub Actions. Additionally, it illustrates how these projects are executed on Cloud Composer using the open-source Python package, dbt-airflow, which renders dbt projects as Airflow DAGs. The following section will guide you through each of these steps, providing a comprehensive approach to effectively scaling your dbt workflows.

Overview of dbt project deployment procedure on Google Cloud – Source: Author

Deploying containerised dbt projects on Artifact Registry with GitHub Actions

In this section, we will define a CI/CD pipeline using GitHub Actions to automate the deployment of a dbt project as a Docker image to Google Artifact Registry. This pipeline will streamline the process, ensuring that your dbt projects are containerised and consistently deployed on a Docker repo where Cloud Composer will then be able to pick them up.

First, let's start with a high-level overview of how the dbt project is structured within the repository. This will help you follow along with the definition of the CI/CD pipeline since we will be working in certain sub-directories to get things done. Note that Python dependencies are managed via Poetry, hence the presence of pyproject.toml and poetry.lock files. The rest of the structure shared below should be straightforward to understand if you have worked with dbt in the past.

.
├── README.md
├── dbt_project.yml
├── macros
├── models
├── packages.yml
├── poetry.lock
├── profiles
├── pyproject.toml
├── seeds
├── snapshots
└── tests

With the project structure in place, we can now move on to defining the CI/CD pipeline. To ensure everyone can follow along, we'll go through each step in the GitHub Action workflow and explain the purpose of each one. This detailed breakdown will help you understand how to implement and customise the pipeline for your own projects. Let's get started!

Step 1: Creating triggers for the GitHub Action workflow

The upper section of our GitHub Action workflow defines the triggers that will activate the pipeline.

name: dbt project deployment
on:
  push:
    branches:
      - main
    paths:
      - 'projects/my_dbt_project/**'
      - '.github/workflows/my_dbt_project_deployment.yml'

Essentially, the pipeline is triggered by push events to the main branch whenever there are changes in the projects/my_dbt_project/** directory or modifications to the GitHub Action workflow file. This setup ensures that the deployment process runs only when relevant changes are made, keeping the workflow efficient and up-to-date.

Step 2: Defining some environment variables

The next section of the GitHub Action workflow sets up environment variables, which will be used throughout the subsequent steps:

env:
  ARTIFACT_REPOSITORY: europe-west2-docker.pkg.dev/my-gcp-project-name/my-af-repo
  COMPOSER_DAG_BUCKET: composer-env-c1234567-bucket
  DOCKER_IMAGE_NAME: my-dbt-project
  GCP_WORKLOAD_IDENTITY_PROVIDER: projects/11111111111/locations/global/workloadIdentityPools/github-actions/providers/github-actions-provider
  GOOGLE_SERVICE_ACCOUNT: [email protected]
  PYTHON_VERSION: '3.8.12'

These environment variables store critical information needed for the deployment process, such as the Artifact Registry repository, the Cloud Composer DAG bucket, the Docker image name, service account details and workload identity federation.