Building a Data Warehouse

Author:Murphy | View: 28025 | Time: 2025-03-22 22:43:31

In this story, I would like to talk about data warehouse design and how we organise the process. Data modelling is an essential part of Data Engineering. It defines the database structure, schemas we use and data materialisation strategies for analytics. Designed in the right way it helps to ensure our data warehouse runs efficiently meeting all business requirements and cost optimisation targets. We will touch on some well-known best practices in data warehouse design using the dbt tool as an example. We will take a better look into some examples of how to organise the build process, test our datasets and use advanced techniques with macros for better workflow integration and deployment.

Structure

Let's say we have a Data Warehouse and lots of SQL to deal with the data we have in it.

In my case it is Snowflake. Great tool and one of the most popular solutions in the market right now, definitely among the top three tools for this purpose.

So how do we structure our data warehouse project? Consider this starter project folder structure below. This is what we have after we run dbt init command.

.
├── README.md
├── analyses
├── dbt_project.yml
├── logs
│   └── dbt.log
├── macros
├── models
│   └── example
│       ├── schema.yml
│       ├── table_a.sql
│       └── table_b.sql
├── profiles.yml
├── seeds
├── snapshots
├── target
│   ├── compiled
│   ├── graph.gpickle
│   ├── graph_summary.json
│   ├── manifest.json
│   ├── partial_parse.msgpack
│   ├── run
│   ├── run_results.json
│   └── semantic_manifest.json
└── tests

At the moment we can see only one model called example with table_a and table_b objects. It can be any data warehouse objects that relate to each other in a certain way, i.e. view, table, dynamic table, etc.

When we start building our data warehouse the number of these objects will grow inevitably and it is the best practice to keep it organised.

The simple way of doing this would be to organise the model folder structure being split into base (basic row transformations) and analytics models. In the analytics subfolder, we would typically have data deeply enriched and transformed always production-grade and ready for Business Intelligence (BI). For instance, if we keep our data in different databases we will have SQL touching raw_prod and base_prod databases in one folder and SQL templates for analytics_prod in another:

.
└── models
    └── some_data_source
        ├── _data_source_model__docs.md
        ├── _data_source__models.yml
        ├── _sources.yml
        └── base
        |   ├── base_transactions.sql
        |   └── base_orders.sql
        └── analytics
            ├── _analytics__models.yml
            ├── some_model.sql
            └── some_other_model.sql

Databases in PROD environment. Image by author.

The good thing is that these SQL files are templates and we can easily deploy them in dev, prod or test databases in our Snowflake data warehouse [1].

Data Warehouse Design Patterns

Under this structure, we can see only one data source for now. It can be anything ingested into our data warehouse, i.e. OLTP database, an external API source or some event stream. At this point, the model looks fairly simple but often the number of data sources starts to grow and models become more complex. It is a good practice to keep data sources separately and move analytics-ready models into the mart folder, i.e.:

└── models
    ├── intermediate
    │   └── finance
    │       ├── _int_finance__models.yml
    │       └── int_payments_pivoted_to_orders.sql
    ├── marts
    │   ├── finance
    │   │   ├── _finance__models.yml
    │   │   ├── orders.sql
    │   │   └── payments.sql
    │   └── marketing
    │       ├── _marketing__models.yml
    │       └── customers.sql
    └── staging
        ├── some_data_source
        │   ├── _data_source_model__docs.md
        │   ├── _data_source__models.yml
        │   ├── _sources.yml
        │   └── base
        │       ├── base_transactions.sql
        │       └── base_orders.sql
        └── another_data_source
            ├── _data_source_model__docs.md
            ├── _data_source__models.yml
            ├── _sources.yml
            └── base
                ├── base_marketing.sql
                └── base_events.sql

In this case we have everything simple and atomic under the base folder, intermediate would summarise staking pieces of logic and everything we might need to prepare our data for analytics. Intermediate models must be very clear and modular so we can incorporate and reuse them further down the stream of our data.

Incremental models

Using incremental models is a good practice to improve data processing times and cost. It helps a lot when we need to prepare our base models efficiently. Consider this example below. This SQL will have to scan all table partitions to bring only the latest events:

CREATE OR REPLACE TABLE "raw_dev"."datasource"."transactions" AS (

SELECT DISTINCT * FROM (
SELECT 
*,
RANK() OVER (PARTITION BY ID ORDER BY updated_at::TIMESTAMP DESC, loaded_at::TIMESTAMP DESC ) as rank
FROM "base_dev"."datasource"."transactions"
  ) 
  WHERE rank = 1

We can do so much better using MERGE instead:

merge into base_dev.datasource.transactions as DBT_INTERNAL_DEST
    using base_dev.datasource.transactions__dbt_tmp as DBT_INTERNAL_SOURCE
        on (
        DBT_INTERNAL_SOURCE.load_ts > dateadd(day, -30, current_date)
        )
        and (
            DBT_INTERNAL_SOURCE.id = DBT_INTERNAL_DEST.id
        )

    when matched then update set
        "ID" = DBT_INTERNAL_SOURCE."ID",
        ...
        "AMOUNT" = DBT_INTERNAL_SOURCE."AMOUNT", ...

    when not matched then insert
        ("ID", ..., "LOADED_AT")
    values
        ("ID", ..., "LOADED_AT")

;

It will scan only the latest partitions to add and calculate the latest state of our base table using the id as a unique key. In dbt it is called an incremental update and the compiled SQL we can see above would normally originate from the template:

{{
    config(
        materialized='incremental',
        unique_key='id',
        on_schema_change='append_new_columns',
        incremental_strategy='merge',
        cluster_by = ['load_ts'],
        incremental_predicates = [
            "DBT_INTERNAL_SOURCE.load_ts > dateadd(day, -30, current_date)"
        ],
        schema='datasource',
        alias='transactions',
        query_tag = 'analytics_ready',
        tags=["incremental", "transactions"]
    )
}}

select
    *
from {{ref('transactions_view')}}
qualify
RANK() OVER (PARTITION BY ID ORDER BY updated_at::TIMESTAMP DESC, load_ts::TIMESTAMP DESC ) = 1
{% if is_incremental() %}

{% endif %}

A good and simple example to explain MERGE can be found here [2]:

Advanced SQL techniques for beginners

Macros

Macros in dbt are pieces of code that can be reused with our abstract SQL templates. That would be an analogue of a function in any programming language.

Supercharge your SQL with macros!

-- macros/item_count.sql
{% macro item_count(item) %}

    SUM(CASE WHEN ITEM = '{{item}}' THEN 1 ELSE 0 END)

{% endmacro %}

This can be used in one of our models like so:

-- models/orders_count.sql

SELECT  ORDER_ID,
ORDER_DATE,
{{ item_count('apple') }} AS apples,
{{ item_count('orange') }} AS oranges

FROM ORDERS
GROUP BY ORDER_DATE

So when we compile the model it will look like this:

SELECT  ORDER_ID,
ORDER_DATE,
SUM(CASE WHEN ITEM = 'apple' THEN 1 ELSE 0 END) AS apples,
SUM(CASE WHEN ITEM = 'orange' THEN 1 ELSE 0 END) AS oranges

FROM ORDERS
GROUP BY ORDER_DATE

I think the idea is clear but we can do much more. For example, we can use macros to dynamically inject the customer database name depending on whether it's a production or dev environment. This approach helps a lot in development and testing. Consider this dbt project configuration file below. We have three different Snowflake environments:

production – where data is pristine clean, clear, quality-approved and unit-tested
dev – where data transformation development happens
test – the environment for automated tests

Database Data Transformation for Data Engineers

# You can select which target to use at runtime by passing the -t or --target parameter to the CLI: dbt run -t blob would run the project against the azure_blob connection. The target: dev line in the file above specifies that the dev target (so the Synapse connection) should be the default, if the target is not specified at runtime.
my-snowflake-warehouse:
  target: dev
  outputs:
    dev:
      account: my-snowflake-warehouse
      database: raw_dev
      user: "{{ env_var('DBT_USER') }}"
      password: "{{ env_var('DBT_PASSWORD') }}"
      role:
      schema: default_schema_name
      threads: 1
      type: snowflake
      user: dbt_dev
      warehouse: test_wh

    prod:
      account: my-snowflake-warehouse
      database: raw_prod
      user: "{{ env_var('DBT_USER') }}"
      password: "{{ env_var('DBT_PASSWORD') }}"
      role:
      schema: default_schema_name
      threads: 1
      type: snowflake
      user: dbt_prod
      warehouse: test_wh

    test:
      account: my-snowflake-warehouse
      database: raw_test
      user: "{{ env_var('DBT_USER') }}"
      password: "{{ env_var('DBT_PASSWORD') }}"
      role:
      schema: default_schema_name
      threads: 1
      type: snowflake
      user: dbt_dev
      warehouse: test_wh

To dynamically inject custom database names we just need to create this macro:

-- ./macros/generate_database_name.sql
{% macro generate_database_name(custom_database_name=none, node=none) -%}

    {%- set default_database = target.database -%}
    {%- if custom_database_name is none -%}

        {{ default_database }}

    {%- else -%}

        {{ custom_database_name | trim }}

    {%- endif -%}

{%- endmacro %}

So now whenever we compile our models it will apply a custom database name from the model`s config, i.e.:

dbt run -t dev -> select * from raw_dev.shema.table
dbt run -t prod -> select * from raw_prod.shema.table
dbt run -t test -> select * from raw_test.shema.table

Custom schemas and custom materializations

Custom schemas are another great application of dbt macros. During the development process, it might be useful to apply a custom suffix to our schema name to indicate who it belongs to. Indeed, before we actually merge and promote our code to staging it might be better to run some transformations in our own schema, i.e. dev_raw.events_mikes.some_table instead of dev_raw.events.some_table. This macro will apply custom schema names:

-- ./macros/generate_schema_name.sql
{% macro generate_schema_name(custom_schema_name, node) -%}

    {%- set default_schema = target.schema -%}
    {%- if custom_schema_name is none -%}

        {{ default_schema }}

    {%- else -%}

        {{ custom_schema_name | trim }}

    {%- endif -%}

{%- endmacro %}

Whatever custom feature we need – it can be done using macros!

Let's consider a materialization as an example. By default dbt has four basic matrialization types for our models – a view, a table, an incremental table (we spoke about it above) and ephemeral. Views and tables are simple but what is ephemeral type? Ephemeral materialization might be useful somewhere in the beginning of our data transformation pipeline. It is essentially a cte and in dbt any SQL under ephemeral template will compile into this and also can be applied to our model:

WITH cte as (
...
)

However, in many scenarios, it might be not enough. Imagine we would want to create a custom SQL script that produces some output but also can be referenced using ref() function.

We can create a custom materialization for this like so:

--./macros/operation.sql

{%- materialization operation, default  -%}
    {%- set identifier = model['alias'] -%}
  {%- set target_relation = api.Relation.create(
        identifier=identifier, schema=schema, database=database,
        type='table') -%}

  -- ... setup database ...
  -- ... run pre-hooks...

  -- build model
  {% call statement('main') -%}
    {{ sql }}
  {%- endcall %}

  -- ... run post-hooks ...
  -- ... clean up the database...

    -- `COMMIT` happens here
    {{ adapter.commit() }}

  -- Return the relations created in this materialization
  {{ return({'relations': [target_relation]}) }}

{%- endmaterialization -%}

Now if we add it to our model config as an operation it will simply run the SQL and we will be able to reference it using the ref() function:

{{ config(
    materialized='operation',
    alias='table_a',
    schema='schema',
    tags=["tag"]

) }}

create or replace table {{this.database}}.{{this.schema}}.{{this.name}} (
     id      number(38,0)   
    ,comments varchar(100)  

);

# dbt run --select table_a.sql
19:10:08  Concurrency: 1 threads (target='dev')
19:10:08  
19:10:08  1 of 1 START sql operation model schema.table_a ................................ [RUN]
19:10:09  1 of 1 OK created sql operation model schema.table_a ........................... [SUCCESS 1 in 1.10s]
19:10:09  
19:10:09  Finished running 1 operation model in 0 hours 0 minutes and 4.16 seconds (4.16s).
19:10:09  
19:10:09  Completed successfully
19:10:09  
19:10:09  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

Now if we want to have another table or a view that can reference this operation we can use the standard ref() function and our table_a would appear as a dependency in data lineage:

--./models/example/table_b.sql

{{ config(
    tags=["example"],
    schema='schema'

) }}
select *
from {{ ref('table_a') }}
where id = 1

Tests

Tests are crucial for our data pipeline integrity and I previously wrote about it in one of my articles [4]. They ensure our data transformation logic persists no matter what changes we make during the dev process. Yes, we are talking about data transformation unit testing.

Unit Tests for SQL Scripts with Dependencies in Dataform

A similar thing can be achieved in dbt too. They are available in dbt cloud and are currently limited to testing SQL models and only models in your current project. For dbt Core, unit tests will be available in v1.8, planned for late April 2024 but there is a way to run them using dbt packages. Consider this example below. We will run the project first to deploy our models and then we will run the test command:

dbt run --select "table_a table_b"
dbt test --select test_table_b


# ./models/example/schema.yml

version: 2

models:
  - name: table_a
    description: "A starter dbt model"
    columns:
      - name: id
        description: "The primary key for this table"
        tests:
          - unique
          - not_null

  - name: table_b
    description: "A starter dbt model"
    columns:
      - name: id
        description: "The primary key for this table"
        tests:
          - unique
          - not_null

unit_tests:
  - name: test_table_b
    description: "Expect a column to be created and table_b must return only one row id = 1."
    model: table_b
    given:
      - input: ref('table_a')
        rows:
          - {id: 1}
          - {id: 2}
          - {id: 3}
          - {id: null}

    expect:
      rows:
        - {id: 1}

This is a more advanced dbt-core example. Here we will use dbt_utils.equality package to compare the expected dataset and the table we get after we run our model:


version: 2

sources:
  - name: datasource
    database: |
        {%- if  target.name == "dev" -%} raw_dev
        {%- elif target.name == "prod"  -%} raw_prod
        {%- elif target.name == "test"  -%} raw_test
        {%- else -%} invalid_database
        {%- endif -%}

    loader: loader_app_for_info_purpose
    tables:
      - name: transactions
        enabled: true
        description: filename LIKE '%payments/%'
        loaded_at_field: "timestamp::timestamp_ntz(9)"
        tests:
          - dbt_utils.equality:
              tags: ['unit_testing']
              compare_model: ref('transactions_test_expected')
              compare_columns:
                - col1
                - col2

Now if we run dbt build - select tag:unit_testing -t test in our command line dbt will build our source table which is transactions to compare it with the expected table built in the test environment. This table is called transactions_test_expected and is referenced using our favourite ref() function.

-- transactions_test_expected.sql
{{ config(
    materialized='table',
    schema='some_schema',
    query_tag = 'source',
    tags=["unit_testing"]

) }}

select
'{
  "data": {
    ...
  }
}'
:: VARIANT
    as raw_data
, 'payments/2023/06/27/test.gz'
:: VARCHAR(16777216)
    as filename
, 1
, timestampadd(days,-1, current_timestamp()) -- i.e. '2020-02-18 20:01:35.123'
:: timestamp_ntz
    as timestamp

ref – the most powerful dbt feature

dbt build --select tag:unit_testing -t test

  [RUN]
19:40:35  2 of 2 PASS dbt_utils_source_equality__db_transactions_col1__ref_transactions_test_expected_  [PASS in 1.53s]
19:40:35  
19:40:35  Finished running 1 table model, 1 test in 0 hours 0 minutes and 10.97 seconds (10.97s).
19:40:35  
19:40:35  Completed successfully
19:40:35  
19:40:35  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

dbt_utils.equality will compile our test into something like this to check that actual table generated after run equals to the expected table we defined for the test:

with a as (
    select * from raw_test.schema.transactions
),

b as (
    select * from raw_test.schema.transactions_test_expected
),

a_minus_b as (

    select filename from a

    except
    select filename from b

),

b_minus_a as (
    select filename from b

    except
    select filename from a

),

unioned as (
    select 'a_minus_b' as which_diff, a_minus_b.* from a_minus_b
    union all
    select 'b_minus_a' as which_diff, b_minus_a.* from b_minus_a

)

select * from unioned

    ) dbt_internal_test

This feature is very useful and this way we can test any model with just one dbt-core package.

Documentation and style

Documenting our datasets and providing useful metadata is really very important. It makes our data platform transparent for any data user. It simplifies the development process and saves time. Dbt offers some handy auto-generating features for docs. All we need is to run these two commands and all our ref() functions and metadata from .yml files and config nodes will appear in one catalogue [5].

dbt docs generate
dbt docs serve

# 19:52:09  Building catalog
# 19:52:16  Catalog written to ...
127.0.0.1 - - [23/Feb/2024 19:51:46] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [23/Feb/2024 19:51:46] "GET /manifest.json?cb=1708717906539 HTTP/1.1" 200 -
127.0.0.1 - - [23/Feb/2024 19:51:46] "GET /catalog.json?cb=1708717906539 HTTP/1.1" 200 -

Data lineage and compiled SQL. Image by author.

The way we name and style our dbt models is important too. As a rule of thumb, we SHOULD NOT:

use abbreviations or aliases, i.e. cust instead of customers
use dots in model names, i.e. my.model.with.dots. Use "_" instead as in many DWH solutions dots are used to reference database names and schemas.
Use non-string data types for keys
Use reserved words for column names
Avoid using the same fields across the schema, i.e. if we have customer_id somewhere then we should stick to it, DO NOT use user_id
date and timestamp fileds should end having "_at"
avoid using UTC. It is really annoying to have timestamps not in the UTC timezone.

Consistency is key!

Conclusion

Building a data warehouse solution is not a trivial task per se. During the process, it is important to keep things organised. In this story, I tried to summarise some techniques for convenient structuring of our data transformation folders. Storing the SQL files this way aims to keep them organised and easy to explore for anyone else involved in this process. Modern data build tools (data form, DBT, etc.) offer a set of useful features to improve this process. We can power our SQL templates by injecting pieces of code using macros, variables and constants. From my experience, this feature combined with infrastructure as code helps to ensure adequate CI/CD processes which saves time during development. It helps to organise and simplify the data environment split between dev, live and automated testing so we can focus on business logic and continuous improvement of our state-of-the-art data platform.

I hope you enjoyed this read! These are just my ideas. Please let me know what you think.