Learn dbt the Easy Way
t concepts

dbt, or data build tool, is all the rage in the Data Modeling world. And why wouldn't it be? It's easy to use, fun to learn and helps you follow coding best practices.
dbt is truly an analytics engineer's dream! It makes our life easier by saving us time writing code. It allows us to test all of our data sources and data models directly within the project- no external tools are needed.
You can make your project as simple or as complicated as you want. dbt uses basic SQL in the writing of its data models. It also uses a templating language called Jinja, but you don't necessarily have to use this if you don't want. You can choose to write all your data models using SQL or write more complicated functions in Jinja.
This is dbt's superpower. If you want to learn how to use it, you easily can. It builds off the SQL knowledge most data analysts and data engineers already have.
And, for the more advanced folk who have been using dbt for years, you can build it into so much more than basic SQL. There's room for everyone to have some fun!
In this article, I'll introduce you to the basics of dbt so that you can utilize this Data Transformation tool in your data models. We will discuss the files required to create a project and their structure as well as the key concepts you need to know in order to understand how dbt works. You'll learn to use data build tool in no time!
dbt Files Needed in Your Project
In order to create a Dbt project, you need a few key files. Luckily, when you create your project, the first file it already populated.
dbt_project.yml
This is the first configuration file that gets automatically created in dbt when you run the dbt init < projectname>
command. It signals to dbt that your environment is a dbt environment and includes all of the information the tool needs to get started.
Here, you name the project as well as specify the name of the profile you will be using for your project (which we will discuss next). If customizing the paths in your project, this is also where you would define those. However, this isn't necessary when just starting out with dbt.
name: 'madison_medium_project'
version: '1.0.0'
config-version: 2
# This setting configures which "profile" dbt uses for this project.
profile: 'madison_db'
# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
target-path: "target" # directory which will store compiled Sql files
clean-targets: # directories to be removed by `dbt clean`
- "target"
- "dbt_packages"
profiles.yml
Your profiles.yml file contains your database-specific information. You will want to include this in a git ignore file if using GitHub, since this contains confidential information. The profile contains the access information needed to read and write to and from your data warehouse.
Depending on the data warehouse you are using, you will have to specify different information in this file. You can find the specific setup instructions for BigQuery, Postgres, Redshift, and some others here.
As for Snowflake, you need to specify a username and password, role, default warehouse, and default schema. I highly recommend creating a username specifically for your dbt environment, a role for external transformation tools, and a warehouse for transforming. This will help you keep track of dbt's actions and how much it is costing you.
madison_db:
target: dev
outputs:
dev:
type: snowflake
account: [account id]
# User/password auth
user: dbt_user
password: [password]
role: transformer
database: dbt_development
warehouse: transform_wh
schema: staging
threads: 100
Your database name should refer to the database where your data models will be written. While the schema should depend on the model name itself, I typically assign the default to the staging schema which we will discuss later.
One last thing to note- the name of this block is the profile name. This should match the string you used in the profile block in your dbt_project.yml file.
src.yml
The src.yml file can be named anything, but always follows the same structure. For simplicity's sake, I recommend always naming it as so, but with the data source itself specified. So, if I was creating a source file for all of my Facebook data, I would name it src_facebook.yml
.
This file references all of your raw data tables and where they live within your data warehouse. Rather than reference your raw data directly, your dbt models will always reference your sources.
Complex transformations read from staging models in dbt, which are simple SQL SELECT statements from the raw data tables. But we will discuss the intricacies of each type of dbt data model later on.
Source files are mainly for thorough documentation. They contain all of the details about your raw data as well as the models that reference them. While it's not a requirement to document attributes like column names and definitions, it is a best practice. If you want to be a good analytics engineer, you need to start with proper documentation.
version: 2
sources:
- name: facebook
database: raw
schema: facebook
tables:
- name: fb_pages
description: "All of the pages viewed by users"
columns:
- name: timestamp
description: "the date and time the user viewed the page"
- name: anonymous_id
description: "the anonymous id of the user"
- name: user_id
description: "the unique id of the user"
When documenting a source you MUST document the database and schema it is located in as well as its table names. If these are not documented, dbt will not be able to find that raw data to read from.
Then, when this model is referenced downstream, this source would be referenced like so:
{{ source('facebook', 'fb_pages') }}
dbt looks for the name of the source first by searching through the various yaml files and then looks at the tables underneath that source as defined. If it cannot find that table name under that source, it will throw an error.
So, always follow this convention when referencing a source:
And, again, as best practice, always document a source's column names and definitions. Now that you understand more about how the structure of a dbt project, let's review some concepts that you'll need to know in order to become a good dbt developer. dbt helps analytics engineers write modular data models. That is why it became so popular! Engineers were constantly rewriting the same code, wasting time and energy. Data models became tightly wound together, making it difficult to track down different pieces of the puzzle. dbt stepped in and made it so that the same piece of code can be referenced in multiple different data models. No more repeating the same logic over and over again! dbt data models are written to be reused and referenced in multiple models, across all different business domains. If the same piece of code is written in two different places, it should be made into its own data model and then referenced in those two places. This way your code only has to be run once, saving you computation costs. Staging models help to promote modularity within your dbt project. They exist as a liaison between your raw data and more complex data models. These are the models that read directly from the data sources you specify in your src.yml files. They contain basic functions such as renaming, timezone conversion, and datatype casting. They keep the integrity of your raw data sources without being the actual copy of your raw data. A simple dbt staging model may look something like this: Each of the different types of dbt data models has its own folder that exists in your dbt project. All staging models are typically stored in a "staging" folder with their corresponding src.yml files. This isn't necessary but I always recommend it for greater organization. Intermediate models sit somewhere between the source data and the final data model that is actually used by business teams. They reference source data and other models but are never actually seen by the data analyst. You can think of them as models that contain key business logic that is repeated throughout multiple data models but not directly used. For example, they could contain logic that maps a user's anonymous id to their actual user id. They could also contain a currency conversion calculation from dollars to pounds. These are both situations that occur frequently within a business but aren't necessarily producing a dataset that the business needs. The business more so needs these in conjunction with some other metric rather than isolated. Here is an example of a recent intermediate model I wrote: I wanted to combine two datasets from Facebook so that I could perform a metric calculation on the entire dataset rather than just one. A simple model like this, where two datasets are merged, is something that may be commonly used across different data models. Lastly, dbt calls the end-result data models core data models. These are the models that are directly used by data analysts and business teams. They contain the data they need to calculate KPIs. These models reference both staging and intermediate data models. They are very similar to intermediate data models besides the fact that they are public facing. Unlike staging models, they do not reference raw data. Similarly to intermediate data models, they contain complex SQL functions. They typically look something like this: While this one is pretty simple, it produces a dataset that would be used by business teams to understand marketing spend. Lastly, dbt is powerful because of its built-in testing. You can test all of your data sources and models written in dbt with dbt tests. The tool offers more generic testing, which should be added to almost all models and source, as well as custom-built packages for more advanced testing. I highly recommend starting with more generic tests such as You simply add these tests under the block where the column is defined in your src.yml file. You'll want to use a "tests" block and then specify the name of the test. Here I added a Producing a dbt project with well-written data models and thorough testing is one of the best skills you can master as an analytics engineer. Understanding and practicing the basics as outlined in this article will help you build a solid foundation that you can then expand on as you develop your dbt skills. First, build your dbt project and connect it to your data warehouse. Then, create a folder for each type of data model. Start by documenting all of your data sources within yaml files and then build the staging models that reference these sources. Lastly, you can write your complex data transformations within intermediate and core data models, utilizing best practices like modularity. Build a solid dbt foundation and you will only continue to grow into an amazing analytics engineer! For more on analytics engineering, the modern data stack, and dbt, subscribe to my free weekly newsletter.{{ source('
')
dbt Key Concepts
Modularity
Staging Models
select
ad_id AS facebook_ad_id,
account_id,
ad_name AS ad_name_1,
adset_name,
month(date) AS month_created_at,
date::timestamp_ntz AS created_at,
spend
from {{ source('facebook', 'basic_ad')}}
Intermediate Models
with
marketing_spend_by_day AS (
select spend_date, spend, 'adwords' AS source from {{ ref('adwords_spend_by_day')}}
UNION ALL
select spend_date, spend, 'facebook' AS source from {{ ref('fb_spend_by_day')}}
)
select * from marketing_spend_by_day
Core Models
with fb_spend_summed AS (
select
month(created_at) AS spend_month,
year(created_at) AS spend_year,
created_at AS spend_date,
sum(spend) AS spend
from {{ ref('stg_facebook_ads') }}
where spend != 0
group by
created_at,
month(created_at),
year(created_at)
)
select * from fb_spend_summed
dbt Tests
not_null
, unique
, and accepted_value
. These can be used to ensure you are ingesting and producing high-quality data.version: 2
sources:
- name: facebook
database: raw
schema: facebook
tables:
- name: fb_pages
description: "All of the pages viewed by users"
columns:
- name: timestamp
description: "the date and time the user viewed the page"
tests:
- not_null
- name: anonymous_id
description: "the anonymous id of the user"
tests:
- not_null
- name: user_id
description: "the unique id of the user"
tests:
- not_null
not_null
test to each of the columns in my fb_pages
source. If one of these columns does produce a null value, the test will fail. This will then let me know that something is probably wrong with my data ingestion tool or my source data. This is a powerful feature for maintaining high-quality data!Conclusion
Comment
Recommend