Metrics Store in Action

There is a lot of literature on Modern Data Stacks (MDS)—most discussions are around storage, ingestion, transformation, and presentation. Today we focus on metrics, one of the many other MDS categories.
Some say that metrics are one component of the semantic layer – see this Airbyte blog post. Others, as we do, use the terms "metrics store," "metrics layer," and "semantic layer" interchangeably.
What is a metrics store, aka metric layer, aka semantic layer?
"Last month's revenue, as reported in the analyst's notebook, is different from that in our dashboard. Which figure should we report to audit?"
"Do you know how we define the Churn Rate KPI in our retention app?"
"How do we segment our revenues from customer orders into geographic regions?" By customer billing location? By delivery location? Else?
Such questions frequently pop up in the companies I have worked for. We all know why: different teams work in silos, implementing metrics with their preferred technology stack. Details are hidden in SQL procedures locked inside an access-restricted database, on some local Power BI files, or elsewhere we don't know. A metrics store is here to help:
- Single point of truth – a central code repository and knowledge database for metrics and dimensions.
- Proxy to the data warehouse – all presentation tools (BI, reports, notebooks) request KPIs through one of the metrics store's APIs.
A metrics store also helps you with governance and caching. But this is out of the scope here.
Transformation layer vs. metric layer – where to draw the line?
The transformation layer is the "T" in ELT, and logic-wise, it comes 2nd right after ingesting raw data (the "EL") into our storage. Like the metrics layer, the "T" step transforms the data before downstream applications consume it. You may ask, how are those two layers different then? Where does the responsibility of "T" end, and where that of the metrics store start? Simon Späti brings it to the point:
A semantic layer transforms/joins data at query time, whereas the transformation layer does during the transform (T) step of ETL/ELT, meaning it's pre-calculated.
Steps that the transformation layer should cover without doubt:
- Reorganize and rename tables and columns by following a consistent style guide like this one using dbt.
- Clean up column data types, like timestamps.
- Add relevant columns, like margin = revenue minus cost.
- Fix known data quality issues.
Whereas aggregations, like total monthly revenues, naturally fall under the responsibility of the metrics store.
Moving more transformation steps into the metrics store comes at the cost of added compute costs – far more than what we save in storage. So, you may want to keep some expensive or frequently requested calculations in the "T" step. This is also where a metrics store's materialization and caching capabilities can help.
Why MetricFlow by Transform Data
We chose MetricFlow for three reasons:
- It is open-source and comes with a friendly license.
- It is a Python package and fits nicely into the data stack we use below for the tutorial. And it is easy to configure.
- dbt Labs will acquire the company behind MetricFlow – a strong indication that MetricFlow will thrive in the coming years.
Do you want to check what other tools compete in the semantic space? See this overview and read this article.
Tutorial part 1 – set up a local environment.
We use DuckDB for storage, dbt for the transformation layer, MetricFlow as our metrics store, and a Streamlit app for presentation. All four tools are Python packages and can be comfortably installed in a Conda virtual environment.
But first, we need to install Postgres and MySQL, both dependencies of the MetricFlow package – even though we use DuckDB. On my Mac, I had to do the following:
brew install postgresql mysql
Next, clone this GitHub repository, cd
to the root, execute
conda env create -f environment.yml
and activate with conda activate local-mds
.
Tutorial part 2 – generate and explore raw data.
The following Python script generates synthetic data and ingests it into a local DuckDB database:
python src/generate-raw-data.py
The "raw" data consists of four tables: accounts, sites, orders, and fx rates. A fictitious business with orders
con = duckdb.connect('data/data-mart.duckdb')
con.query('select * from raw.orders limit 5').df()
Id SiteIdDelivery DateBooked CurrencyCode Price Cost Type
0 o00000 s00018 2020-10-28 USD 13.44 9.36 Parts
1 o00001 s00032 2020-05-04 USD 84.74 49.30 Parts
2 o00002 s00009 2021-08-16 USD 76.38 41.19 Installation
3 o00003 s00025 2022-07-26 USD 25.51 9.99 Installation
4 o00004 s00038 2020-02-16 GBP 49.54 20.08 Parts
booked in different currencies with exchange rates
con.query('select * from raw.fx_rates limit 5').df()
Year Currency Rate
0 2020 Euro 1.10
1 2021 Euro 1.20
2 2022 Euro 1.15
3 2020 Pound Sterling 1.30
4 2021 Pound Sterling 1.40
and delivered to sites
con.query('select * from raw.sites limit 5').df()
Id SiteOwnerId Region
0 s00000 a00000 Antarctica
1 s00001 a00001 Europe
2 s00002 a00002 South America
3 s00003 a00003 South America
4 s00004 a00004 South America
owned by customers with accounts
con.query('select * from raw.accounts limit 5').df()
Id CompanyName BillingRegion
0 a00000 Cisco Systems, Inc. Antarctica
1 a00001 Asbury Automotive Group Europe
2 a00002 Helping Hand South America
3 a00003 Buena Vista Realty Service South America
4 a00004 The Pink Pig Tavern South America
Tutorial part 3 – transform data with dbt.
Our git repository already comes with a Dbt project called _datamart. Transforming data with dbt is as simple as writing SQL templates and placing them into the "models" directory data_mart/models/
.
We follow dbt's style guide. First, we create a "staging" schema in models/staging/
. Every staging model mirrors raw data with simple renaming, casting, and nothing else. We transform the exchange rates by
-- data_mart/models/staging/stg_fx_rates.sql
select
Year::integer as year,
Currency as currency_name,
Rate as to_usd_rate
from raw.fx_rates
The three other staging tables follow a very similar logic.
The staging layer is not supposed to be consumed by downstream applications directly. For this, we create our 2nd and final transformation layer, "marts." We parse currency names into codes so that they match those from our orders:
-- data_mart/models/marts/fct_fx_rates.sql
select
year,
case
when currency_name = 'Euro' then 'EUR'
when currency_name = 'Pound Sterling' then 'GBP'
else null
end as currency_code,
to_usd_rate
from {{ ref('stg_fx_rates') }}
And we use those rates to convert all order amounts to USD:
-- data_mart/models/marts/fct_orders.sql
with orders as (
select *,
extract('year' from booked_date)::integer as booked_year
from {{ ref('stg_orders') }}
), rates as (
select * from {{ ref('fct_fx_rates') }}
)
select
orders.order_id,
orders.deliver_to_site_id,
orders.order_type,
orders.currency_code,
orders.order_price_lcu,
orders.order_cost_lcu,
orders.order_price_lcu * rates.to_usd_rate as order_price_usd,
orders.order_cost_lcu * rates.to_usd_rate as order_cost_usd,
orders.booked_date
from orders
left join rates
on orders.booked_year = rates.year and orders.currency_code = rates.currency_code
Note how we reference our final exchange rates table – this is where dbt shines! And also, note that we have to add a year column temporarily to join exchange rates on year and currency code.
The two other final tables are just 1-to-1 copies of staging, bringing our dbt modeling to the end. We finish our dbt exercise with three steps:
- We materialize staging as views and the final layer as tables, as recommended in the dbt style guide. We can do this most comfortably through a change in the dbt project configuration file:
# bottom of data_mart/dbt_project.yml
models:
data_mart:
+materialized: table
staging:
+materialized: view
+schema: staging
marts:
+materialized: table
+schema: marts
- We configure the connection to our database:
# $HOME/.dbt/profiles.yml
data_mart:
outputs:
dev:
type: duckdb
path: # replace with absolute path
target: dev
- Finally, we materialize all models as tables in our database on the command line:
dbt run --project-dir data_mart/
We can verify this final step by querying any of the new tables. E.g.,
con = duckdb.connect('data/data-mart.duckdb')
con.query('select * from main_marts.fct_orders limit 5').df()
order_id deliver_to_site_id ... order_cost_usd booked_date
0 o00000 s00048 ... 10.2960 2020-10-28
1 o00010 s00033 ... 86.1250 2020-01-21
2 o00014 s00039 ... 20.8725 2022-04-25
3 o00015 s00035 ... 32.8580 2021-08-18
4 o00019 s00026 ... 0.3410 2020-07-27
Tutorial part 4 – metrics and dimensions
Recall that the Metrics store defines metrics and across which dimensions we can compute them. The dedicated place for this is in the directory metrics/
in the root of our git repository.
The two main objects in MetricFlow are data sources and metrics. And unlike with dbt, we configure those objects in YAML. It is then the responsibility of the metrics store to translate requests into SQL and execute them against our database.
Let's start with the specification of metrics first. We have defined two in the same YAML file:
# metrics/metrics.yml
metric:
name: order_revenue_usd
type: measure_proxy
type_params:
measure: order_revenue_usd
---
metric:
name: order_margin_percent
type: expr
type_params:
expr: (order_profit_usd / order_revenue_usd) * 100
measures:
- order_profit_usd
- order_revenue_usd
MetricFlow gives us a couple of different ways to define a metric by choosing one of several types. Type measure_proxy takes any defined measure and applies its default aggregation strategy. Type expr allows us to use SQL syntax. We define measures and aggregation strategies in data sources. Here is our specification for source orders:
# metrics/source_orders.yml
data_source:
name: orders # must be unique across all sources
description: Master table customer orders.
sql_table: main_marts.fct_orders # the corresponding database table
identifiers: # primary and foreign keys
- name: order_id
type: primary
- name: site_id
type: foreign
expr: deliver_to_site_id
measures:
- name: order_revenue_usd
description: The total order revenue in USD.
expr: order_price_usd
agg: sum
- name: order_profit_usd
description: The profit of orders in USD.
expr: order_price_usd - order_cost_usd
agg: sum
dimensions: # temporal or categorical attributes, or "dimensions"
- name: order_type
type: categorical
- name: booked_date
type: time
type_params:
is_primary: true
time_granularity: day
mutability: # only relevant for caching
type: immutable
We can aggregate measures across every dimension specified in the corresponding section. And by using the specified foreign key identifiers, we can even do so using dimensions from other data sources.
One of the "identifiers" points to a foreign data source, which is
# metrics/source_sites.yml
data_source:
name: sites
description: Master table customer sites.
sql_table: main_marts.dim_sites
identifiers:
- name: site_id
type: primary
- name: account_id
type: foreign
expr: owner_account_id
dimensions:
- name: site_region
type: categorical
mutability:
type: immutable
With this, we can compute order measures across site regions. And since sites are linked to accounts through another foreign key specification, we can even compute order measures across site customer dimensions.
The point about using YAML is its simplicity. You don't need in-depth engineering experience to understand the specifications. Next time the business asks you how exactly a given metric is defined, point them to the corresponding specification in the YAML.
Finally, we need to connect MetricFlow to our database:
# $HOME/.metricflow/config.yml
model_path: # /metrics
dwh_schema: main_marts
dwh_dialect: duckdb
dwh_database: # /data/data-mart.duckdb
You can verify the connection by executing on your command line:
mf health-check
Tutorial part 5 – MetricFlow APIs in action
Traditionally, metrics were defined in multiple places: hidden inside BI apps, report implementations, Jupyter notebooks, etc. We overcame this problem by moving metrics and dimensions into a single place – the metrics store. But this move works only if our metrics store integrates well with our data stack. Does our BI app engine know how to communicate with MetricFlow? This will be true for some and wrong for many others.
Like with many other MDS categories, integration with your current stack will drive your decision for a metrics store solution against others. Below we use MetricFlow's CLI and Python interfaces. And the latter works for any BI tool which talks Python, like Streamlit.
We start with our first example on the command line:
mf query --metrics order_revenue_usd --dimensions booked_date__year --order booked_date__year
✔ Success