Handling Hierarchies in Dimensional Modeling

Author:Murphy | View: 26815 | Time: 2025-03-22 20:39:57

There are various modeling techniques for hierarchies. Which of them performs best in dimensional modeling in Data Warehouses? And how to address various types of hierarchies with them? Let's find out.

Hierarchies play a crucial role in dimensional modeling for Data Warehouses, influencing both the structure and efficiency of data analysis. Drawing from my own experiences in implementing data solutions for various companies, this article explores the Best Practices and techniques for handling various types of hierarchies in dimensional modeling. Through detailed examples and practical guidelines, I'll navigate the complexities of dealing with different types of hierarchies to ensure robust and scalable data warehouse designs.

When working with hierarchies, it's important to recognize their specificity and all related nuances. So before digging into modeling techniques, let's see what quirks we can find in real-life hierarchy scenarios. The examples in this article are made-up, but inspired by actual cases I had in one of the projects implemented for a global pharmaceutical company. Even though they are significantly simplified, they still demonstrate interesting aspects of data modeling.

Overview of Sample Hierarchies

Let's consider the following sample hierarchies: internal organization structure of a global company, geography, and products (pharmaceuticals) hierarchy.

Logical breakdown of sample hierarchies in the Pharma domain | Image by author

Organization Hierarchy

A global pharma company has several Business Units. Each Business Unit is composed of Divisions. Some larger Divisions have Sub-Divisions (but not all of them). At the lowest level, there are Performance Units. If a Division has Sub-Divisions, it doesn't have any Performance Units under it directly.

Geography Hierarchy

The company sells its products worldwide. It tracks performance at Regional and Country levels. In each Country, there are one or more Performance Units. Each Performance Unit operates in a single Country. Some larger countries are split into Sub-Country Regions (depending on a country, these might be "states", "provinces", "prefectures", etc.).

Product Hierarchies

Pharmaceuticals manufactured by the company are identified as SKUs at the lowest level. SKUs are local, i.e. country-specific. Each SKU has its local name and other attributes like package type, size, label languages, etc. SKUs are grouped into Local Products, then into Local Therapeutic Areas. All the local items are Country-specific. At the same time, Local Products are mapped to Global (international) Brands. Some of the Global Brands are considered Key Brands. All Global Brands are grouped into Global Therapeutic Areas. The [SKU > Local Product > Local Therapeutic Area] hierarchy is considered the Local Product Hierarchy. [Global Product > Global Brand > Global Therapeutic Area] is considered the Global Product Hierarchy.

Sample Facts

Another important aspect of hierarchies is understanding the facts to be tracked and how they link to the hierarchies. For the sake of this article let's assume we have the following types of facts:

Sales Actuals – reported at the SKU and Performance Unit level,
Sales Activities – reported at the Performance Unit, Country or Sub-Country Region (depending on a Country), and Local Product level,
Sales Targets – defined at the Division and Global Product level,
Competition Sales – collected at the Global Brand and Country level (collected from data vendors like IQVIA, allowing to track own market share).

The diagram below shows the four sample fact types and their relationships with the hierarchies.

Logical model of facts' relationships with the sample hierarchies | Image by author

Having the stage setup, let's see how to physically model the data structures for the hierarchies. In this article, I focus on dimensional modeling techniques, so we're going to model the hierarchies as dimensions.

One last thing: to focus on modeling data structures for hierarchies representation and not clutter the model with irrelevant details (in the context of this article), I assumed that all dimensions are SCD1 (though, in real life they all should be SCD2).

Naming the Dimension Tables

Before we start modeling the dimension tables, let's establish a naming convention. This not only enhances clarity and organization but also ensures that the tables are easily identifiable and maintainable.

Prefix for Dimension Tables

First, we will use a specific prefix for all dimension tables to distinguish them from fact tables and other database objects. A commonly used practice is to prefix these tables with d_ or dim_. I'll go with d_ prefix for brevity.

Common Prefix for Hierarchical Levels

Next, it's beneficial to use a common prefix for all tables that serve as levels within the same hierarchy. This practice further enhances clarity by grouping related tables. For example, if we are modeling an organizational hierarchy, we could use d_org_ as the prefix. Tables within this hierarchy might then be named d_org_business_unit, d_org_division, etc. Similarly, for a product hierarchy, we might use d_prd_, resulting in tables like d_prd_sku, d_prd_local_product, etc. For geographic data, d_geo_ could be used, leading to tables such as d_geo_region, d_geo_country, etc.

When the tables are displayed in a database client application, they are usually ordered alphabetically, so this convention will keep related tables together. Also, when using a client application with an autocompletion feature, entering the common prefix will display all tables related to a given hierarchy.

Column Naming Convention

When it comes to naming columns within these tables, adopting a consistent and descriptive convention is equally important. One effective approach is to use the Classwords convention. If you are not familiar with it, I recommend reading my other article in which I explain it:

Classwords – My Favorite Convention for Naming Database Columns

Also, I use the name _sk for PK columns. Want to know why? I explain it in this article:

Best Practices for Technical Columns in Database Design

Modeling the Physical Dimensions

Identification of Levels for Foreign Key References

The first step in modeling physical dimensions for hierarchies is to identify the levels within the hierarchy that need to be referenced using foreign keys (FKs). These levels are typically the primary entities that are crucial for business analysis and reporting. For each of these entities, a corresponding dimension table should be created, and a surrogate key (SK) should be assigned. This surrogate key will serve as a unique identifier for each record in the dimension table and will be used in the fact tables to establish relationships between the dimensions and the facts.

For example, in the Organization Hierarchy, only the Performance Unit level is being referenced from facts tables. On the other hand, in the Product Hierarchies, we have four levels that are referenced by FKs: SKU, Local Product, Global Product, and Global Brand.

In the diagram below, all levels of the dimensions, for which we need SKs (to let them be used as targets of FK constraints) are marked with a gold key symbol. These levels constitute a base of corresponding dimension tables. They determine their granularity and higher levels of the hierarchies can be modeled as their attributes.

Identification of hierarchies' levels that are targets of FKs | Image by author

Modeling the Remaining Levels

In some cases, certain levels within a hierarchy may only be used for dynamic aggregation purposes in reporting rather than as primary entities. These levels can be treated as attributes within the existing dimension tables instead of creating separate dimension tables for them.

Of course, all logical levels of the hierarchies can be modeled as separate dimension tables. However, in complex real-life scenarios with hierarchies composed of dozens of levels, this may clutter the database schema with lots of excessive tables.

The approach to keep them as attributes aligns with the Open-Closed Principle, which originates from programming and states that software entities should be open for extension but closed for modification. It fits the Data Modeling domain equally well. In case any of these remaining levels will need to become a target of an FK reference, a separate dimension table can be created on an ad-hoc basis. An FK to this new dimension would need to be added as a new column to the previous dimension, in which the level served as an attribute. It will create a redundancy: the level will be both an attribute in the legacy and a new dimension. But that's a small price to pay for the benefit of the initial simplification.

So, let's see in the diagram below, what the physical model looks like for Product Hierarchies.

Logical to physical model mapping for Product Hierarchies | Image by author

The six levels of Product Hierarchies (local and global) are mapped to four physical dimension tables with FK relationships (one-to-many) between them. Each level that is a target of an FK relationship has its own physical dimension table.

Handling Ragged Hierarchies

A ragged hierarchy is a type of hierarchy where the depth of the levels can vary. This means that not every branch of the hierarchy descends to the same level. The variability in depth can complicate the modeling and querying processes. The Organization Hierarchy (as defined above) provides a clear example of this. The company is divided into several Business Units, each comprising multiple Divisions. Some of these Divisions have an additional layer of Sub-Divisions, but this is not uniform across all Divisions. At the lowest level of the hierarchy, we find Performance Units, which directly report to either Divisions or Sub-Divisions, depending on the branch.

Ragged Hierarchy as Self-Referencing Dimension

This variability in structure requires a flexible modeling approach to ensure that all hierarchical relationships are accurately captured and can be analyzed effectively. Managing such ragged hierarchies typically involves using a parent-child relationship table, where each entity points to its immediate parent, allowing the hierarchy to be traversed regardless of its depth. This modeling technique is also known as the Adjacency List Model.

Sample physical structure of self-referenced dimension table for ragged Organization Hierarchy | Image by author

The image above presents a sample structure of a table for such a self-referenced dimension for Organization Hierarchy. It has the following columns important for hierarchy modeling:

_sk is a PK of an item (element of the hierarchy),
org_hierarchy_parent_sk is a FK to a parent node (except for top-level units, i.e., Business Units for which it's NULL),
level_number is an integer denoting the level of the hierarchy the item represents (e.g., 1 for top-level (Business Unit), 2 for second-level items (Division), etc.); I recommend using continuous numbering, so Performance Units will be at level 3 and 4 (depending on whether a Division they are a part of has Sub-Divisions or not),
level_code is a technical abbreviation of the hierarchy level (e.g., PU for Performance Unit, SD for Sub-Division, etc.),
unit_name is an actual name of a specific unit of the organization.

With such columns, it's easy to get a list of all Performance Units with their parent units (regardless of whether these are Sub-Divisions or Divisions):

SELECT
    pu._sk AS performance_unit_sk,
    pu.unit_name AS performance_unit_name,
    parent._sk AS parent_unit_sk,
    parent.unit_name AS parent_unit_name,
    parent.level_code AS parent_level_code
FROM
    d_org_hierarchy pu
LEFT JOIN
    d_org_hierarchy parent ON pu.org_hierarchy_parent_sk = parent._sk
WHERE
    pu.level_code = 'PU'

It's also possible to iteratively traverse up the hierarchy starting from any unit at any level:

WITH RECURSIVE org_hierarchy_cte AS (
    -- Anchor member: start from the given unit
    SELECT
        _sk,
        hierarchy_parent_sk,
        level_code,
        unit_name
    FROM
        d_org_hierarchy
    WHERE
        sk = @starting_unit_sk  -- Replace @starting_unit_sk with the actual starting unit SK

    UNION ALL

    -- Recursive member: traverse up the hierarchy
    SELECT
        parent.sk,
        parent.org_hierarchy_parent_sk,
        parent.level_code,
        parent.unit_name
    FROM
        d_org_hierarchy parent
    INNER JOIN
        org_hierarchy_cte cte ON cte.org_hierarchy_parent_sk = parent.sk
)
SELECT
    sk,
    level_number,
    level_code,
    unit_name
FROM
    org_hierarchy_cte

This approach for modeling ragged hierarchies comes with significant limitations:

It's not possible to create an FK relationship from facts to a specific level of the hierarchy (e.g., from Sales Targets to Performance Unit).
It's not possible to effectively manage attributes of hierarchy levels (as each level may have different attributes and there's a common table for all of them).

How to address these limitations?

Ragged Hierarchy as Level-Specific Dimensions

An alternative option to model ragged hierarchies is by using level-specific (or logical group of levels-specific) dimensions. In the example of Organization Hierarchy, there will be two dimension tables: the first one for Performance Units and Sub-Divisions, and the second one for Divisions and Business Units.

Sample physical structure of dedicated dimension tables for ragged Organization Hierarchy | Image by author

With this approach, both limitations of self-referenced dimension tables are easily addressed. But what about the Sub-Division level which is optional?

There are three approaches to fill the subdivision_name column for cases, where there's no Sub-Division level. The simplest and the most obvious one is to set them to NULL . Yet, it might be challenging on the reporting side where such optional levels are usually not well supported in aggregation/drill-down/drill-through scenarios. The two other options are to either up-shift or down-shift the values from a lower or a higher level of the hierarchy, accordingly. Both of them ensure continuity in the hierarchy levels. Up-shifting involves re-using values from a lower level for the missing level. Down-shifting assumes re-using values from a higher level. The table below compares the three methods.

Three approaches to fill values for a missing hierarchy level | Image by author

While the up-/down-shifting methods may obscure true hierarchy structure, they transform a ragged hierarchy into a regular hierarchy with fixed levels. For example, if you want to report values of metrics at a Sub-Division level, and for Divisions that don't have Sub-Divisions use the Division level, you should go with the down-shifting approach. Then a query as simple as the one below will do the trick:

SELECT
    pu.subdivision_name,
    SUM(sa.sales_amount_eur) AS total_sales_amount_eur
FROM
    f_sales_actual sa
LEFT JOIN
    d_org_performance_unit pu ON pu._sk = sa.performance_unit_sk
GROUP BY
    pu.subdivision_name

Keep in mind, that for more complex ragged hierarchies with several optional levels, it's OK to mix up- and down-shifting for various levels.

I also recommend adding an indicator for each parent level which has an optional sub-level, to recognize whether the sub-level name is its actual name or whether it has been up- or down-shifted. This is why in the diagram above for the physical structure of the Organization Hierarchy I've added has_subdivisions_indicator column to the d_org_division dimension table.

Unified Hierarchy View

Another tip that I found pretty handful is to create a unified (and denormalized) view for each hierarchy composed of multiple dimension tables. It's especially useful with large multi-table hierarchies. The view is a simple materialization of all joins between the dimensions composing the entire hierarchy.

Unified (denormalized) view for Product Hierarchies | Image by author

What's the benefit of using it? First of all, it's a time saver for any analytical or troubleshooting operations when you don't need to re-write all the joins by hand. You can easily use the view to see the full path (incl. all the attributes of all the levels) from any leaf node to the root node. Example SQL query for Product Hierarchy and some SKU at leaf level:

SELECT
    *
FROM
    d_prd_hierarchy
WHERE
    sku_number = 'ABC-XYZ-123'

Also, it's useful for any Quality Assurance-related tasks for the hierarchy. For that purpose, I recommend making all the joins in the view full outer joins to make sure nothing is lost, even if the parent-child relationships between the hierarchy levels are broken. Apart from easily spotting any broken links, you can use such a view, e.g., for Organization Hierarchy to find incorrect down-shifting logic from Divisions to Sub-Divisions (between two physical dimension tables):

-- Select all invalid application of down-shifting logic for Sub-Divisions
SELECT DISTINCT
    subdivision_name,
    division_name
FROM
    d_org_hierarchy
WHERE
    division_has_subdivisions_indicator = 0
    AND subdivision_name <> division_name

Denormalized Base-Level Dimension + Shrunken Rollup Conformed Dimensions

An alternative approach to the Unified Hierarchy View described above is to create a Denormalized Base-Level Dimension (an equivalent of the Unified Hierarchy View but in the form of a regular table) and then derive from it all Conformed Dimensions in the form of views.

Product Hierarchies modeled as Denormalized Base-Level Dimension with derived Shrunken Rollup Conformed Dimensions | Image by author

This approach is based on Ralph Kimball's Shrunken Rollup Dimensions technique. The idea is to have a Denormalized Base-Level Dimension table with all possible levels of a hierarchy and their attributes. An important aspect, not covered by Kimball's technique, is to include columns for all SKs of higher levels that are to serve as targets of FKs. And there are two important reasons for that:

Shrunken Rollup Dimensions for the higher levels of the hierarchy can be generated as simple views (see code example below) from the Denormalized Base-Level Dimension (thus you don't need to implement any custom logic to update them when the base dimension is updated).
A complete logic to compute the entire hierarchy is kept as a single unit of code that's responsible for the generation of the Denormalized Base-Level Dimension.

This approach is very similar to the one described in Design Tip #137 Creating and Managing Shrunken Dimensions – Kimball Group.

An example code to generate d_prd_global_brand from d_prd_sku is as simple as:

CREATE VIEW d_prd_global_brand AS
SELECT DISTINCT
    global_brand_sk AS _sk,
    global_brand_name AS brand_name,
    global_brand_is_key_brand_indicator AS is_key_brand_indicator,
    global_therapeutic_area_name,
    ... -- a placeholder for additional dimension levels/attributes
FROM d_prd_sku

In practice, for performance reasons, you should create the Shrunken Rollup Dimensions as materialized views or equivalent objects (depending on the capabilities of the database engine used).

Guidelines Summary

In this article, I've demonstrated practical guidelines, based on my real-world experiences, for handling hierarchies in dimensional modeling. Next time when you are working on a data model that incorporates some hierarchies, consider the following tips:

When establishing a naming convention for dimension tables, consider using a common prefix for dimensions serving as levels of a common hierarchy.
Identify critical levels within hierarchies for FK references and determine when to model levels as separate dimension tables versus attributes.
When dealing with ragged hierarchies consider modeling them as regular hierarchies using level-specific dimensions with up- and/or down-shifting of values for optional hierarchy levels.
Create denormalized views to simplify analytical and troubleshooting tasks by providing a consolidated perspective of the entire hierarchy.
Alternatively, model the full hierarchy as a Denormalized Base-Level Dimension and derive from it higher levels of the dimension (only those that are to serve as a target of a FK) as Shrunken Rollup Conformed Dimensions.