Data Warehouse, Redefined

Author:Murphy | View: 21428 | Time: 2025-03-22 20:38:10

Okay, I used "Redefined" again – I did already in my article about the need to redefine data engineering. Oh my god, what else does he want to redefine, you might ask.

Well, I think there are quite a few things in today's Data Architecture that deserve a closer look. But the motivation to critically examine the current definition of the data warehouse goes back to a question from one of my readers.

I wrote a three-part series on the challenges and solutions in data mesh. The adapted data mesh as described in the articles, serves as a replacement architecture for data collection approaches like the data warehouse (traditional, modern, and variations like data lake/lakehouse). However, I emphasized that we still need the data warehouse as one among many applications contributing information to the data mesh.

Challenges and Solutions in Data Mesh

The question now was like „Do I really need twice as many engineers to set up the new (Data Mesh) and the old (data warehouse)?".

Why Redefinition?

I think this deserves a clear answer and with the answer comes the need for a redefinition of the Data Warehouse.

To make it clear right away, I'm not saying this lightly. I even built my consulting company on the ideas of the data warehouse concept. So it was not easy for me to admit to my clients and myself that it was the wrong approach to achieve truly universal data supply in the company. Let me explain why I think that the data warehouse does not fully deliver on this promise.

Let's take a look back at the beginnings, when Bill Inmon formulated the idea of the data warehouse. The core objectives he wanted to achieve were:

(Re-)integrate the information islands that originated from the fact that operational applications stored business data in isolated databases.
Appropriately separate the analytical from the operational workload to avoid unwanted mutual interference and also to enable complete tracking of the data history, which was not possible in operational systems.

Hence, it seemed logical and reasonable to offload the operational data to a separate system to reintegrate the siloed data stores and at the same time relieve the operational databases from analytical workload.

With this approach he tried to work around these two main problems:

The impossibility of realizing all the company's processing requirements with a single application – or with several applications that share an integrated database.
The technical inability of database systems at the time (mainly relational systems) to equally cope with operational and analytical workload.

The latter, because it's mainly a technical one, has been resolved by now. Today we have the tools and databases that allow a mixed operational and analytical workload – either with a single Unified Real-Time (Data) Platform (URP) or with a distributed architecture such as the data mesh providing a rich set of data tooling, today known as the modern data stack. It is therefore no longer justified to build a data warehouse for this technical reason.

Unfortunately, the former problem has not yet been solved. We still don't have a fully integrated version of all the information created by so many applications in the enterprise. Even Domain-Driven Design (DDD) has not saved us from the emergence of information islands. Information integration is regrettably still an afterthought in the operational world. But if the operational plane fails to take care of the integration problem, it is in fact only postponed to the data plane.

Information Integration is Still Needed

We still need to integrate isolated information and the data warehouse is one approach that is capable of doing this downstream from the operational systems. But it is naive to believe that a single central concept or architecture running in the data plane is able to completely compensate for what has been missed in the operational plane.

Even the godfather of the dimensional data warehouse, Ralph Kimball, said that the enterprise data warehouse can only be implemented as physically independent data marts interlinked by the data warehouse bus architecture.

I totally agree that this problem can only be solved with a highly distributed approach. However, the adapted data mesh seems to be the better architecture concept for this. Even if the integration task is greatly simplified following modern enterprise data modeling, there is still a consolidation need for central business departments such as controlling, risk management, finance, or marketing.

These central business departments often need cross-domain aggregation on abstract business objects. These objects cannot be directly obtained from a single domain-specific data product. It therefore makes sense to implement this integration logic once for several consumers.

An illustrative example of four different data products created by the respective banking-product systems – Image by author

As an illustrative example, let's take the typical situation for a commercial bank that has different operational systems for each product type they offer to the customers. In the simple example, we have four different banking product systems, each of which has created a separate data product to be provided in the data mesh. Each data product type belongs to a different business domain – say deposits, current account, term loans, and mortgages. Both the controlling and the risk management department, for example, need to analyze KPIs across all product types, aggregated to an abstract product category and customer type that is not provided by the source systems. Hence, we have the need for both departments to map the source system specific product attributes to an overall valid product category, combine it with customer data and aggregate over all four product types.

The Data Warehouse Reduced to Integration and Transformation

Data warehouses can implement such cross-domain integration and transformation services for multiple consumers. Their functions can essentially be reduced to the creation of specialized data products to enable an integrated view to otherwise isolated data stores. As a matter of fact, in large corporations we already have several, more or less interconnected data warehouses. This happened because in large corporations it's practically impossible to create one single data store for all the enterprise data requirements. Just as it's impossible to create one single application for all the enterprise processing requirements.

We can redefine the data warehouse as a pure integration and transformation application or as a plane of such applications and services. In this setup, it will no longer serve as the single version of truth or the single centralized data store of complete enterprise-wide information for all. It will act as one application among many others to serve the specific needs of central business departments. Each of these applications in the plane should contribute data products to an adapted data mesh in order to realize the universal data supply.

Unbundling the Data Warehouse

In addition to integration services, data warehouses often also offer generic query services for other analytical or operational consumer applications (BI, reporting, AI/ML applications). This database functionality is typically part of the data warehouse itself. But it can equally be distributed to other, more specialized serving applications that receive the data via the data mesh.

By creating data products and feeding them into the data mesh, we can also efficiently supply the data to consumer applications, that don't need such query functionality but instead just want to consume the data directly as batch or as a stream. These consumer applications can be operational as well as analytical, which means that this approach closes the great divide of operational and analytical planes.

The naive thinking that data quality can be reconstructed or injected downstream to the operational systems should finally also be taken to the cemetery. It's as if a car manufacturer believed that quality problems in its production processes could be rectified in downstream quality assurance. The often cited medallion architecture is a sign of this misconception. If at all, this idea is only applicable for regulatory reporting needs. Such „quality enhancements" downstream of the data product creation, can only be a workaround for poor quality of the information architecture upstream. It has to be addressed at the source, because any attempt to fix it later will cause a disproportionate amount of work.

The Data Warehouse unbundled – Image by author

What we're really doing is splitting the monolithic data warehouse architecture into separate, independent services and interconnecting them via the data mesh. Extraction and data quality improvement needs to be shifted left to the business domain's source applications that create data products. Integration and transformation still occur within the data warehouse plane, but it's important to remember that the application of business logic should be managed by the central business departments, not the data team. The data team is the maintainer of the data mesh. The redefined data warehouse however, is to be built by the central business domains by using services from the data mesh.

Such a reduced functionality may not warrant to be called data warehouse anymore – after all, its more a conceptual plane of data integration and transformation applications (coordinated pipelines if you like). Since I have no talent whatsoever on inventing catchy names for architectural components, I will leave it at the redefinition of data warehouse.

In a world of distributed services that work together to meet overall business needs, the idea of a centrally managed data warehouse is backward-looking. Neither the evolution to the Modern Data Warehouse (MDW) nor the data lakehouse, whether on-premise or in the cloud, are able to fully deliver on the promise of universal data supply.

These centralized data collection methods, which originated from traditional data warehouse thinking, are increasingly turning into new, massive data behemoths. They claim to centrally manage everything related to data, but we should not go down this path.

While using highly integrated applications might work for small and agile companies, we can't address the diverse data needs of large enterprises with such monolithic systems or architectures. We need to break them down into smaller, decoupled components that work together via a network.

In the processing plane, we use the well-known distributed microservices architecture. In the data plane, the adapted data mesh, which interconnects applications and services from the processing plane via data products, is better equipped for this task than the good old data warehouse.

Let's therefore redefine the data warehouse and reduce it to a data integration and transformation conceptual plane that creates data products designed as pure data structures for cross-domain requirements – see my article "Deliver Your Data as a Product, But Not as an Application" to understand what I mean by pure data structures.

If you find this information useful, please consider to clap. I would be more than happy to receive your feedback with your opinions and questions.

Tags: Data Architecture Data Engineering Data Mesh Data Warehouse Notes From Industry