A Guide to Building Performant Real-Time Data Models

Author:Murphy  |  View: 21504  |  Time: 2025-03-23 13:04:03
Photo by Lukas Blazek on Unsplash

Data has become a critical tool for decision-making. To be actionable, data needs to be cleaned, transformed, and modeled.

This process is often part of an ELT pipeline that runs at a given frequency, for example daily.

On the other hand, to adjust and make decisions fast, stakeholders sometimes need access to the most recent data to be able to react fast.

For example, if there is a huge drop in the number of users of a website, they need to be aware of this issue quickly and be given the necessary information to understand the problem.

The first time I was asked to build a dashboard with real-time data, I connected it directly to the raw table that was real-time and provided some simple KPIs like the number of users and crashes. For monthly graphs and deeper analysis, I created another dashboard connected to our data model, that was updated daily.

This strategy was not optimal: I was duplicating logic between the data warehouse and the BI tool, so it was harder to maintain. Moreover, the real-time dashboard could only perform well with a few days of data, so stakeholders had to switch to the historical one to check earlier dates.

I knew we had to do something about it. We needed real-time data models without compromising performance.

In this article, we'll explore different solutions to build real-time models, and their pros and cons.

Views

An Sql view is a virtual table that contains the result of a query. Unlike tables, views do not store data. They are defined by a query that is executed every time someone queries the view.

Here is an example of a view definition:

CREATE VIEW orders_aggregated AS (
  SELECT 
    order_date, 
    COUNT(DISTINCT order_id) AS orders,
    COUNT(DISTINCT customer_id) AS customers
  FROM orders
  GROUP BY order_date
 )

Even when new rows are added to the table, views stay up to date. However, if the table is big, views might become very slow as no data is stored.

They should be the first option to try out if you are working on a small project.

If the logic is complex, with joins and window functions, you'll probably face extremely long loading times for your dashboards.

Pros

✅ They are really easy to set up

✅ They are always up-to-date

Cons

❌ They perform poorly with big amounts of data or complex calculations

Often-Refreshed Tables

If your data needs to be very recent but not exactly real-time, a good solution is to refresh the table very often.

Here is how we could define a query to refresh our table on a bihourly basis:

DELETE FROM orders_aggregated 
WHERE order_date BETWEEN "2023-07-17 08:00:00" AND "2023-07-17 08:30:00";
INSERT INTO orders_aggregated (
  SELECT 
    order_date, 
    COUNT(DISTINCT order_id) AS orders,
    COUNT(DISTINCT customer_id) AS customers
  FROM orders
  WHERE order_date BETWEEN "2023-07-17 08:00:00" AND "2023-07-17 08:30:00"
  GROUP BY order_date
 )

At Scopely, we have some models that are refreshed every half an hour. Those models have great performance and provide information from the current day. The topic is not critical enough to need data from the last 30 minutes.

We added a little bit of complexity to our pipeline: it runs so often that it sometimes fails. When our daily pipeline fails, we just rerun it manually. But it would be a nightmare for a model that runs 48 times a day. Therefore we added an extra piece of code to ensure that if a run fails, the next one integrates data from the previous run.

Pros

✅ They achieve very good performance

Cons

❌ Data is not exactly real-time

❌ Pipeline runs often and needs to be watched closely

Materialized Views

Most modern cloud data warehouses have an object called a materialized view. A materialized view is an object that stores the result of a query in a physical table.

In some data warehouses, materialized views need to be refreshed with a trigger, but in others, like BigQuery, they can be refreshed automatically when new rows are added. This guarantees a good quality of data as the incremental logic is handled by the warehouse itself.

The downside of using materialized views is that they come with many constraints.

If we try to use the same query as before to build a materialized view in BigQuery:

CREATE MATERIALIZED VIEW orders_aggregated AS (
  SELECT 
    order_date, 
    COUNT(DISTINCT order_id) AS orders,
    COUNT(DISTINCT customer_id) AS customers
  FROM orders
  GROUP BY order_date
 )

We get an error:

Incremental materialized views do not support the DISTINCT clause for aggregation function 'count'

Materialized views usually support a restricted syntax. Instead, we could use the function APPROX_COUNT_DISTINCT:

CREATE MATERIALIZED VIEW orders_aggregated AS (
  SELECT 
    order_date, 
    APPROX_COUNT_DISTINCT(order_id) AS orders,
    APPROX_COUNT_DISTINCT(customer_id) AS customers
  FROM orders
  GROUP BY order_date
 )

Pros

✅ They combine the performance of a table with the simplicity of a view

✅ There is no need to design the incremental logic

Cons

❌ The allowed syntax is extremely limited

Lambda View

This idea came to us when we realized our logic was very simple: we used views for real-time data and tables for historical data. So why couldn't we use a view that was just a UNION ALL of the two?

After a little bit of research, we found out that this concept already had a name: lambda architecture.

Lambda architecture is a system that combines batch processing (the incremental table part) and streaming (the view part).

Very enthusiastic, I tried building a lambda view based on my orders_aggregated table. I would simply use a where filter on a date and update that view every day to change the value of the filter to the current date.

CREATE VIEW orders_aggregated_lv1 AS (
  -- Batch layer
  SELECT 
    *
  FROM orders_aggregated
  WHERE DATETIME(order_date) < "2023-07-17"

  UNION ALL 
  -- Stream layer
  SELECT 
    order_date, 
    COUNT (DISTINCT order_id) AS orders,
    COUNT(DISTINCT customer_id) AS customers
  FROM orders
  WHERE DATETIME(order_date) >= "2023-07-17"
  GROUP BY order_date

 )
SELECT *
FROM orders_aggregated_lv1
WHERE DATETIME(order_date) = "2023-07-15"

But the execution time was very disappointing. Looking at the graph, I then understood why: BigQuery's planner was still recomputing the whole view, even though it should only have looked at the table.

Image by author

Instead of knowing from the WHERE filter it doesn't have to look at the view, it looks at the value of the column order_date in the whole view.

So I used a small workaround and hardcoded the date value for the view:

CREATE  VIEW orders_aggregated_lv2 AS (
  -- Batch layer
  SELECT 
    *
  FROM orders_aggregated
  WHERE DATETIME(order_date) < "2023-07-17"

  UNION ALL 
  -- Stream layer
  SELECT 
    DATE("2023-07-17") as order_date, 
    COUNT (DISTINCT order_id) AS orders,
    COUNT(DISTINCT customer_id) AS customers
  FROM orders
  WHERE DATETIME(order_date) = "2023-07-17"
  GROUP BY order_date
 )
SELECT *
FROM orders_aggregated_lv2
WHERE DATETIME(order_date) = "2023-07-15"

This time, the execution time was very fast and the planner only read data from the table!

Image by author

However, we though we might be missing part of the data if this query did not run at exactly midnight, or if our daily pipeline failed.

So we added one day of margin:

CREATE VIEW orders_aggregated_lv3 AS (
  -- Batch layer
  SELECT 
    *
  FROM orders_aggregated
  WHERE DATETIME(order_date) < "2023-07-17"

  UNION ALL 
  -- Stream layer
  SELECT 
    DATE("2023-07-17") as order_date, 
    COUNT (DISTINCT order_id) AS orders,
    COUNT(DISTINCT customer_id) AS customers
  FROM orders
  WHERE DATETIME(order_date) = "2023-07-17"
  GROUP BY order_date

    UNION ALL 
  -- Stream layer
  SELECT 
    DATE("2023-07-18") as order_date, 
    COUNT (DISTINCT order_id) AS orders,
    COUNT(DISTINCT customer_id) AS customers
  FROM orders
  WHERE DATETIME(order_date) = "2023-07-18"
  GROUP BY order_date

 )

If our pipeline fails and we don't fix it immediately, we still have the data, it's just a little less performant.

Since we are working with data build tool, we were able to write that iteration logic in Jinja2 syntax and avoid repeating ourselves.

Pros

✅ Lambda views achieve great performance

✅ Several days of margin can be defined to make sure they keep being up to date when the pipeline fails

Cons

❌ The logic to make the query planner efficient is complex

❌ One lambda view relies on at least two database objects

Which one is the best solution?

There is no best solution per se; it all depends on your data and use case. However, if you are still struggling to decide after reading about the four options, here is a little decision tree to help you reach a decision:

Image by author

Resources


I hope you enjoyed this article! If you did, please follow me for more content on Python, SQL, and analytics, like this tutorial on ELT pipelines:

How to Build an ELT with Python

Tags: Data Analysis Data Modeling Data Science Sql

Comment