Data Warehouse Design Patterns

Author:Murphy  |  View: 27267  |  Time: 2025-03-22 23:05:34

I needed a data warehouse tool for my new data project recently. This story is about how I built it from scratch and organized everything in it. Designing a data platform is not a trivial task and often modern data warehouse solutions are at the center of its architecture. It provides robust data governance features, simplified data querying using ANSI SQL and enhanced data modelling capabilities. Organising everything inside, i.e. data environments, tests, naming conventions, databases, schemas and tables might be a challenging task due to the high number of data sources and complexity of required transformations. This story might be useful for beginner and intermediate-level users who would like to learn advanced data warehousing techniques. With seasoned data practitioners, I would like to discuss what they think about data warehouse design and how they would typically organize everything inside.

Designing a data platform

As a data engineer, I design data pipelines every day. This is what modern data platform consists of and it must be cost-effective, scalable and easy to maintain in the long run. Designing pipelines for data-intensive applications is always challenging and a modern data warehouse (DWH) aims to simplify and enhance this process providing easy access to data, better data governance capabilities and easy-to-maintain data transformations required for Analytics and business intelligence.

It always makes sense to use a DWH in our data platform when users would like to access and explore data themselves and there is a business requirement for reporting. Modern data warehouses simplify data access and data governance and I believe this is an integral part of any modern data platform. I previously raised this discussion here [1]:

Data Platform Architecture Types

I chose to use the data lake as a permanent landing area and to stage data before I actually load it into the Data Warehouse. Cloud service providers offer cloud storage solutions where access and retention policies can be easily managed. I decided to use it as a single source of truth for all data pipelines I deploy.

Cloud storage acts as a proxy polygon where I stage raw data first and keep it untouched in the archive.

Data Lake House architecture. Image by author.

Why choose a data warehouse for your data platform?

Data warehouse solutions often act as foundations for modern data stacks. Data can be ingested in raw format or with the required transformation (ETL) applied and we can access it using SQL. There are plenty of tools in the market that allow you to do that, i.e. Fivetran, Stitch, Airbyte, etc. If we know how to code we can deploy bespoke services to extract and ingest data. I previously wrote about it in one of my tutorials [2].

Building a Batch Data Pipeline with Athena and MySQL

Inside a DWH data is transformed using modern data transformation tools like DBT and Dataform using SQL too. These tools use SQL-like templates and can apply data and schema suffices depending on the data environment.

Data environment split between live and staging is one of the most important tasks in data warehouse design

Ultimately modern DWH is a powerful tool where we can manage data effectively and it is important to set everything right in the first place to ensure everything flows as expected. We would want to declare our databases, schemas and tables correctly. Applying relevant naming conventions might help to resolve a lot of potential issues in the future.

Modern Data Warehousing

Modern DWH tools such as Snowflake, Redshift, BigQuery, etc. typically consist of different databases, each with its own distinct function and purpose. Inside databases, we would use schemas that consist of tables and views to categorise our datasets further.

Schema categorisation helps users to identify the datasets they need. Now they know where to look for data exactly.

Depending on the type of our data platform and the transformations we apply inside a DWH it makes sense to create a dedicated database for raw or source data. This will be the first component of our data warehouse where data lands untransformed "as is". We can then use it to enrich, cleanse the data and apply basic field-type transformations to promote it to production. By "production" I mean the state of our data where it can be considered pristine clean and production-grade after required data manipulations so we can present it to someone, i.e. BI team or any other business stakeholder. These "PROD"-grade views and tables can be easily connected to any BI solution where they can be modelled by BI developers.

Data Modelling For Data Engineers

It is the best practice to keep only a portion of data in the RAW database and use it to update our "BASE" or "PROD" database tables. We can define a data expiration policy for this and use SQL MERGE to update data in production. Consider this MERGE example below. It explains how the dataset will be updated.

-- Creating Tables
CREATE OR REPLACE TABLE target_table (
 customer_id NUMBER,
 is_active BOOLEAN,
 updated_date DATE
)
;

CREATE OR REPLACE TABLE source_table (
 customer_id NUMBER,
 is_active BOOLEAN
)
;

-- Inserting test values
INSERT INTO target_table VALUES
(1, TRUE, '2024-01-01'),
(2, TRUE, '2024-02-01'),
(3, FALSE, '2024-03-01')
;

INSERT INTO source_table VALUES
(1, FALSE),  -- update record
(2, TRUE),   -- update record
(4, TRUE)    -- new record
;

-- UPSERT STARTS HERE
-- Insert a missing row and update the existing ones
MERGE INTO target_table tgt
USING source_table src 
 ON tgt.customer_id = src.customer_id
WHEN MATCHED THEN
 UPDATE SET
   tgt.is_active = src.is_active,
   tgt.updated_date = '2024-04-01'::DATE
WHEN NOT MATCHED THEN
 INSERT
   (customer_id, is_active, updated_date)
 VALUES
 (src.customer_id, src.is_active, '2024-04-01'::DATE)
; 

-- Output
| Number of Rows Inserted | Number of Rows Updated |
|-------------------------|------------------------|
|           1             |           2            |

There is a good example using MERGE in this story:

Advanced SQL techniques for beginners

I also recommend creating a dedicated database for analytics. This will be a place for ad-hoc analytics and materialised queries and views.

Indeed, often it makes sense to create a separate database where we can store the results of longer analytics SQL queries and data explorations.

This database can be also connected to a BI solution. Analysts often write queries to explore data and generate one-off reports that don't require daily updates (or at any other frequency). Therefore, we don't need to keep these queries in the "PROD" or "BASE" database.

So the final diagram for the DWH database split will be the following:

DWH database levels. Image by author.

It looks simple but it is powerful. This conceptual diagram provides a simple explanation for each DWH database. A database can be considered as a set of DWH instances and objects, i.e. schemas, views, tables, UDFs, etc. The next step would be to split it into different data environments.

In Analytics database data analysts can create any tables they need to save the results of their queries. This database can be connected to any BI tool or dashboard.

Organising data environments

Ideally, we would want to have an environment where we stage, mock and develop our data transformation and a production environment where data approved and tested data transformations run on a schedule, i.e. daily, hourly, etc. So in the end we will have ANALYTICS_DEV and ANALYTICS_LIVE databases. The same can be applied to RAW and BASE areas. We can set up a DEV data pipeline feeding data into one of our RAW_DEV tables or we can mock data manually using SQL INSERT.

Modern data transformation tools like DBT and Dataform can offer this feature. We can either tell our data transformation engine to use a specific git branch for that particular DWH project we need and/or apply additional schema suffixes (_DEV, _LIVE/_PROD) to databases and schemas that exist in a single project. I do it this way in my Snowflake DWH.

DBT config to change your source database depending on the evironment. Image by Author.
Databases in DEV environment. Image by author.

Let's imagine we promote our data transformation scripts to "PROD" by merging our git development branch into a master branch. DBT tool will compile our SQL templates by adding "_PROD" suffixes to database names.

Databases in PROD environment. Image by author.

I am building a Snowflake data warehouse and my databases and schemas will look like this:

DATABASE_NAME   SCHEMA_NAME     
-------------------------------
RAW_DEV         SERVER_DB_1     -- mocked data
RAW_DEV         SERVER_DB_2     -- mocked data
RAW_DEV         EVENTS          -- mocked data
RAW_PROD        SERVER_DB_1     -- real production data from pipelines
RAW_PROD        SERVER_DB_2     -- real production data from pipelines
RAW_PROD        EVENTS          -- real production data from pipelines
...                             
BASE_PROD       EVENTS          -- enriched data
BASE_DEV        EVENTS          -- enriched data
...                             
ANALYTICS_PROD       REPORTING  -- materialised queries and aggregates
ANALYTICS_DEV        REPORTING  
ANALYTICS_PROD       AD_HOC     -- ad-hoc queries and views

In Snowflake we can use this query to list all schemas in our account:

show schemas in account
;

select
    "database_name" as DATABASE_NAME
    ,"name" as SCHEMA_NAME
from table(result_scan(last_query_id()))
where SCHEMA_NAME not in ('INFORMATION_SCHEMA') -- optional filter(s)
;

Consider another example below. In this story, I used Dataform (now a part of Google Cloud Platform) [3] to create data environments and split them between two GCP projects – one for DEV and one for PROD.

It might be a good idea to create a dedicated project for one of our data environments as it makes it easier to manage and delete all associated resources if needed.

Staging and Production in Google BigQuery. Image by author.

Easy way to create Live and Staging environments for your Data

In Dataform config file we can specify a new database to use for staging purposes:

defaultDatabase on staging. Image by author.

Additionally, we might want to add a database for automated tests, i.e. PROD_TEST. Where our CI pipelines can test data transformation logic. I previously wrote about it in this tutorial [4].

Unit Tests for SQL Scripts with Dependencies in Dataform

Although this example might seem a bit complex it follows the idea of using templated SQL files for testing purposes, mocking test data, to compare with the expected outcome after running the transformation SQL with adding schema suffix.

I borrowed this logic from conventional unit testing and it worked well for me. If you are familiar with coding anything like that can be deployed as a simple microservice and added to our CI/CD pipeline [5].

Alerts and load monitoring

There are a couple of things we might want to do with our data warehouse to ensure everything works as expected:

  • load monitoring – to see if anything failed to load. It can be a failed file with the wrong format or a streaming data record that doesn't comply with table schema.
  • data quality checks – any missing datasets from external providers data malformed, i.e. NULL, formats, etc.
  • cost monitoring – to identify any long-running queries and costs
  • usage monitoring – to alert if someone was trying to do something.

There are various ways to do it. For instance, we can create ALERTS in Snowflake using SQL like so:

-- create email integration
CREATE NOTIFICATION INTEGRATION snowflake_notification_email_int
  TYPE=EMAIL
  ENABLED=TRUE
  -- ALLOWED_RECIPIENTS=('[email protected]');
;
-- try and send
CALL SYSTEM$SEND_EMAIL(
    'snowflake_notification_email_int',
    '[email protected]',
    'Email Alert: Task A has finished.',
    'Currency conversion alert. Potential data issues. nStart Time: 10:10:32nEnd Time: 12:15:45n'
)
;
-- create and schedule alert
CREATE OR REPLACE ALERT currency_conversion_live_alert
  WAREHOUSE = test_wh
  SCHEDULE = 'USING CRON 0 6 * * * UTC'
  IF ( exists (
    select * from (
      SELECT
        count(*) = 0 as alert
      FROM CURRENCY_CONVERSION
      where 
      date(date) = current_date()
    )
      where alert = true
    )
  )
  THEN
  SET alert_name= '(Missing data)';
  CALL SYSTEM$SEND_EMAIL(
    'snowflake_notification_email_int',
    '[email protected]',
    'Email Alert: Currency conversion.',
    $alert_name
);
ALTER ALERT currency_conversion_live_alert RESUME;
EXECUTE ALERT currency_conversion_live_alert;
show alerts;

It will result in email signalling about potential data outages:

Email notification example. Image by author.

This is a simplified example. Similar things can be done in Google Cloud BigQuery and other data warehouse solutions. I previously wrote about it here [6].

Automated emails and data quality checks for your data

The ideal scenario would include sending this ALERT data to a dedicated monitoring service, i.e. Datadog, etc. This can be done programmatically. It can be integrated easily with any other service like PagerDuty, etc. to actually assign tickets to relevant users automatically.

Consider this example below. It will use the Snowpark library to create such a pipeline where we can use an AWS Lambda function to schedule data checks. Alternatively, we can add this feature to any potential data service we create as an API and run these checks regularly.

import os
import requests
from cryptography.hazmat.backends import default_backend
from cryptography.hazmat.primitives.asymmetric import rsa
from cryptography.hazmat.primitives.asymmetric import dsa
from cryptography.hazmat.primitives import serialization

import snowflake.snowpark as snowpark
from snowflake.snowpark.session import Session

def lambda_handler(event, context):
  try:

    with open("./rsa_key.p8", "rb") as key:
      p_key= serialization.load_pem_private_key(
          key.read(),
          password=os.environ['PRIVATE_KEY_PASSPHRASE'].encode(),
          backend=default_backend()
      )

    pkb = p_key.private_bytes(
        encoding=serialization.Encoding.DER,
        format=serialization.PrivateFormat.PKCS8,
        encryption_algorithm=serialization.NoEncryption())

    connection_parameters = {
      "user":'USERNAME',
      "private_key":pkb, # Follow the Snowflake instructions to create a user with key-pair authentication
      "account":'snowflake-account-name',
      "session_parameters": {
          'QUERY_TAG': 'Alerting',
      },
      "warehouse":"TEST_WH",
      "database":"TEST",
      "schema":"TEST"
    }
    new_session = Session.builder.configs(connection_parameters).create()

    #  Get alerts from Snowflake as a dataframe
    alerts_df = new_session.sql("select * from TEST_ALERTS_VIEW")

    new_session.close() 

    # POST alert data to a data monitpring service
    r = requests.post('https://some.monitoring.service.org / post', data ={'alert': alerts_df['alert']})

  except Exception as e:
      print(e)
  message = 'Successfully submitted {}.'.format('alerts')
  return {
      'statusCode': 200,
      'body': { 'lambdaResult': message }
  }

Python for Data Engineers

This scenario requires a Snowflake user with key-pair authentication [7]. Please follow Snowflake insructions to create one.

Conclusion

I always start a new data warehouse project with planning and design sessions. Be sure to think about databases and schemas in the first place as it might be challenging to change them later in the process because they eventually will be connected to BI tools and other consumers. Applying relevant names to source tables and schemas will help users to find data faster. Organizing data environments in the right ways helps with automated testing and CI pipelines that will ensure your data transformation scripts run correctly following the data transformation logic defined in business requirements. There are many possible ways to deploy data platform resources that will feed data into our data warehouse solution and it is a good idea to reflect them in table descriptions. This will become visible to all stakeholders ensuring transparency across the data platform. The majority of monitoring DBA tasks can be easily automated which saves a lot of time in the long run. It is definitely something we might want to consider too. I highly recommend documenting everything using a data modelling or data transformation tool. In this case, we can keep everything in Git and it can be displayed in our schema and table definitions as well.

Thanks for reading. I'm looking forward to learning what you think about it.

Recommended read

[1] https://towardsdatascience.com/data-platform-architecture-types-f255ac6e0b7

[2] https://towardsdatascience.com/building-a-batch-data-pipeline-with-athena-and-mysql-7e60575ff39c

[3] https://towardsdatascience.com/easy-way-to-create-live-and-staging-environments-for-your-data-e4f03eb73365

[4] https://medium.com/towards-data-science/unit-tests-for-sql-scripts-with-dependencies-in-dataform-847133b803b7

[5] https://towardsdatascience.com/test-data-pipelines-the-fun-and-easy-way-d0f974a93a59

[6] https://medium.com/towards-data-science/automated-emails-and-data-quality-checks-for-your-data-1de86ed47cf0

[7] https://docs.snowflake.com/en/user-guide/key-pair-auth

Tags: Analytics Data Engineering Data Science Data Warehouse Database

Comment