Running Jaffle Shop dbt Project in Docker
If you are new to data build tool (dbt) you have probably come across the so called Jaffle Shop, a project used for testing purposes.
jaffle_shop
is a fictional ecommerce store. This dbt project transforms raw data from an app database into a customers and orders model ready for analytics.
One fundamental issue I observed with the Jaffle Shop project is that it expects users, who may be newcomers to dbt, to configure and host a local database for the dbt models to materialize.
In this tutorial, I'll demonstrate how to create a containerized version of the project using Docker. This will allow us to deploy a Postgres instance and configure the dbt project to read from and write to that database. I'll also provide a link to a GitHub project I've created that will help you get all the services up and running in no time.
Subscribe to Data Pipeline, a newsletter dedicated to Data Engineering
Creating the Dockerfile and docker-compose.yml
Let's begin by defining the services we want to run through Docker. First, we'll create a [docker-compose.yml](https://github.com/gmyrianthous/jaffle_shop/blob/main/docker-compose.yml)
file where we'll define two services. The first service will be the Postgres database, and the second will be a custom service that we'll create in the next step using a Dockerfile.
# docker-compose.yml
version: "3.9"
services:
postgres:
container_name: postgres
image: postgres:15.2-alpine
environment:
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=postgres
ports:
- 5432
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 5s
timeout: 5s
retries: 5
Dbt:
container_name: dbt
build: .
image: dbt-jaffle-shop
volumes:
- ./:/usr/src/dbt
depends_on:
postgres:
condition: service_healthy
The file specifies the version of Docker Compose being used (version 3.9). It defines two services, postgres
and dbt
, each with their own settings.
The postgres
service is based on the official postgres
Docker image version 15.2-alpine. It sets the container name to postgres
, maps port 5432 (the default port for Postgres) to the host machine, and sets environment variables for the Postgres user and password. The healthcheck
section specifies a command to test if the container is healthy, and sets a timeout and retries for the check.
The dbt
service specifies a dbt
container of the Docker image of current directory (using a Dockerfile). It mounts the current directory as a volume within the container, and specifies that it depends on the postgres
service, and will only start once the postgres
service is healthy.
In order to containerize the Jaffle Shop project, we need to create a [Dockerfile](https://github.com/gmyrianthous/jaffle_shop/blob/main/Dockerfile)
that installs the necessary dependencies for both Python and dbt, and ensures that the container remains active once the environment has been set up.
# Dockerfile
FROM --platform=linux/amd64 python:3.10-slim-buster
RUN apt-get update
&& apt-get install -y --no-install-recommends
WORKDIR /usr/src/dbt
# Install the dbt Postgres adapter. This step will also install dbt-core
RUN pip install --upgrade pip
RUN pip install dbt-postgres==1.2.0
RUN pip install pytz
# Install dbt dependencies (as specified in packages.yml file)
# Build seeds, models and snapshots (and run tests wherever applicable)
CMD dbt deps && dbt build --profiles-dir ./profiles && sleep infinity
Configuring Postgres with dbt
To interact with dbt, we'll use the dbt Command Line Interface (CLI). A directory containing a [dbt_project.yml](https://github.com/gmyrianthous/jaffle_shop/blob/main/dbt_project.yml)
file is considered a dbt project by the dbt CLI.
We'll create one and specify some basic configurations, such as the dbt project name and the profile
to use (which we'll create in the next step). Additionally, we'll specify the paths containing the various dbt entities and provide configuration about their materialization.
# dbt_project.yml
name: 'jaffle_shop'
config-version: 2
version: '0.1'
profile: 'jaffle_shop'
model-paths: ["models"]
seed-paths: ["seeds"]
test-paths: ["tests"]
analysis-paths: ["analysis"]
macro-paths: ["macros"]
target-path: "target"
clean-targets:
- "target"
- "dbt_modules"
- "logs"
require-dbt-version: [">=1.0.0", "<2.0.0"]
models:
jaffle_shop:
materialized: table
staging:
materialized: view
Now the profiles.yml
file is used to store dbt profiles. A profile consists of targets, each of which specifying the connection details and credentials for the database or the data warehouse.
# profiles.yml
jaffle_shop:
target: dev
outputs:
dev:
type: postgres
host: postgres
user: postgres
password: postgres
port: 5432
dbname: postgres
schema: public
threads: 1
This file defines a profile named jaffle_shop
that specifies the connection details for a Postgres database running on a Docker container named postgres
.
jaffle_shop
: This is the name of the profile. It's an arbitrary name chosen by the user to identify the profile.target: dev
: This specifies the default target for the profile, which in this case is nameddev
.outputs
: This section lists the output configurations for the profile, with the default output configuration nameddev
.dev
: This specifies the connection details for thedev
target, which uses a Postgres database.type: postgres
: This specifies the type of the output, which in this case is a Postgres database.host: postgres
: This specifies the hostname or IP address of the Postgres database server.user: postgres
: This specifies the username used to connect to the Postgres database.password: postgres
: This specifies the password used to authenticate with the Postgres database.port: 5432
: This specifies the port number on which the Postgres database is listening.dbname: postgres
: This specifies the name of the Postgres database to connect to.schema: public
: This specifies the schema name to use when executing queries against the database.threads: 1
: This specifies the number of threads to use when running dbt tasks.
Jaffle Shop dbt models and seeds
The source data for Jaffle Shop project consists of csv files for customers, payments and orders. In dbt, we can load this data into our database through seeds. We then use this source data to build dbt models on top of it.
Here's an example model that generates some metrics for our customers:
with customers as (
select * from {{ ref('stg_customers') }}
),
orders as (
select * from {{ ref('stg_orders') }}
),
payments as (
select * from {{ ref('stg_payments') }}
),
customer_orders as (
select
customer_id,
min(order_date) as first_order,
max(order_date) as most_recent_order,
count(order_id) as number_of_orders
from orders
group by customer_id
),
customer_payments as (
select
orders.customer_id,
sum(amount) as total_amount
from payments
left join orders on
payments.order_id = orders.order_id
group by orders.customer_id
),
final as (
select
customers.customer_id,
customers.first_name,
customers.last_name,
customer_orders.first_order,
customer_orders.most_recent_order,
customer_orders.number_of_orders,
customer_payments.total_amount as customer_lifetime_value
from customers
left join customer_orders
on customers.customer_id = customer_orders.customer_id
left join customer_payments
on customers.customer_id = customer_payments.customer_id
)
select * from final
Running the services via Docker
Now let's build and spin up our Docker services. To do so, we'll simply need to run the following commands:
$ docker-compose build
$ docker-compose up
The commands above will run a Postgres instance and then build the dbt resources of Jaffle Shop as specified in the repository. These containers will remain up and running so that you can:
- Query the Postgres database and the tables created out of dbt models
- Run further dbt commands via dbt CLI
Running dbt commands via CLI
The dbt container, has built the specified models already. However, we can still access the container and run dbt commands via dbt CLI, either for new or modified models. To do so, we will first need to access the container.
The following command will list all active containers:
$ docker ps
Copy the id of dbt
container, and then enter it when running the next command:
$ docker exec -it /bin/bash
The command above will essentially give you access to the container's bash which means you are now able to run dbt commands.
# Install dbt deps (might not required as long as you have no -or empty- `dbt_packages.yml` file)
dbt deps
# Build seeds
dbt seeds --profiles-dir profiles
# Build data models
dbt run --profiles-dir profiles
# Build snapshots
dbt snapshot --profiles-dir profiles
# Run tests
dbt test --profiles-dir profiles
Note that since we have mounted the local directory to the running container, any changes in the local directory will be reflected to the container immediately. This means you are also able to create new models or modify existing ones and then go into the runnning container and build models, run tests, etc.
Querying the dbt models on Postgres database
You are also able to query the postgres database and the dbt models or snapshots created on it. In the same way, we will have to enter the running postgres container in order to be able to query the database directly.
# Get the container id for `postgres` service
$ docker ps
# Then copy the container id to the following command to enter the
# running container
$ docker exec -it /bin/bash
We will then use psql
, a terminal-based interface for PostgreSQL that allows us to query the database:
$ psql -U postgres
The two commands shared below can be used to list tables and views respectively:
postgres=# dt
List of relations
Schema | Name | Type | Owner
--------+---------------+-------+----------
public | customers | table | postgres
public | orders | table | postgres
public | raw_customers | table | postgres
public | raw_orders | table | postgres
public | raw_payments | table | postgres
(5 rows)
postgres=# dv
List of relations
Schema | Name | Type | Owner
--------+---------------+------+----------
public | stg_customers | view | postgres
public | stg_orders | view | postgres
public | stg_payments | view | postgres
(3 rows)
And you can now query dbt models via a SELECT
query:
SELECT * FROM ;
Getting the full code
I've created a GitHub repository you can clone on your local machine and run the containerised version Jaffle Shop dbt project quickly. You can find the project as well as the code shared in this tutorial in the following link.
GitHub – gmyrianthous/jaffle_shop: This is a containerised version of Jaffle Shop dbt project
Final Thoughts
Data build tool (dbt) is one of the rapidly growing technologies in modern data stacks. If you're just starting to learn how to use dbt, I highly recommend experimenting with the Jaffle Shop project. It's a self-contained project created by dbt Labs for testing and experimentation purposes.
dbt is a tool commonly used by data analysts and analytics engineers (in addition to data engineers), and it requires a connection to a database or data warehouse. However, many analysts might not be comfortable configuring and initializing a local database.
In this article, we demonstrate how to get started with dbt and run all the services required to materialize dbt models on a local Postgres database. I hope this tutorial will help you get your dbt project and database up and running as quickly as possible. If you experience any issues running the project, please let me know in the comments, and I'll do my best to help you debug your code and configuration.