Avoid Building a Data Platform in 2024

Author:Murphy | View: 21511 | Time: 2025-03-23 11:53:01

You might think there's no value in reading about what not to build. But the proliferation of data and analytical platform tools, the fact that the Modern Data Stack (MDS) is losing traction, and the many articles about ‘building a data platform' have led me to issue this warning.

It is a warning to IT professionals who work for larger corporations. As every consultant tends to answer all questions, ‘it depends' on your specific situation whether this warning is relevant. So decide four yourself based on the background information I'm about to provide here.

The Data Platform

Perhaps because Zhamak Dehghani called it "data infrastructure as a platform", or because the cloud providers sell us "Data Platform as a Service (DPaaS – not to be confused with Data Protection as a Service)", or because building a platform is simply in vogue, we are caught up in this idea.

But what is a data platform actually?

Let's take the cloud computing model of a platform to better understand the value proposition. It provides developers with a platform to build, deploy, and manage applications without dealing with the underlying infrastructure. PaaS includes tools and services like application frameworks, databases, and development environments, allowing developers to focus on writing code rather than managing servers and networks. While PaaS focuses on providing a platform for application development and deployment, DPaaS is specifically tailored for handling data-centric tasks and workflows.

If you read my articles about the need to redefine the data warehouse concept and the data engineering discipline, you'll see how I define data-centric. It should solely target data concerns and not address business logic concerns. However, the available data platforms try to be a better, all-encompassing development platform for everything.

Designing an architecture

I think the whole problem starts with the idea of building a platform for a specific purpose rather than designing an architecture for the whole organisation. If we try to provide something valuable on the enterprise level, we will fail if we think in products, tools and platforms like IT vendors do. It's natural for a cloud computing provider to sell products like DPaaS. But this platform product thinking will not help us to find the right design for a comprehensive enterprise architecture.

The IT architecture for the company must seamlessly link the needs of the business logic with business data concerns. The fact that we distinguish between logic and data should not steer our thinking towards an architecture with a separate application platform and data platform.

This mindset, among other negative outcomes, has also brought us to the problem termed "The Great Divide of Data".

Image by Author, inspired by the Great Divide of Data by Zhamak Dehghani

We received two decoupled platforms that specialize in handling applications for operational data and applications for analytical data with ETL pipelines connecting the two planes.

You might argue that, to fix this, we just need to develop a single platform for both planes instead. But let me try to convince you that it will never be possible to develop a single comprehensive platform for all IT concerns in a large enough organisation. And this applies to platforms for applications as well as for data.

Designing a data architecture that works for the enterprise is very different from what so many articles try to convey about platform building. Simply select tools or even complete platforms for every issue that needs to be covered in your company and put them together like Lego bricks sounds nice and simple. That may be a strategy for an IT vendor to sell you a bundled product. But it leads to architectures with significant redundancies and overlaps if applied carelessly in a large corporation.

Attention, further platforms ahead

Developments in the data technology sector not only led to data platforms, but also shaped the discipline of Data Engineering. A discipline that seemed unavoidable given the scope and complexity of the tools provided. I have explained that this discipline has to be redefined and why – mainly because of too much overlap with other platforms and disciplines we already have.

In fact, we are currently experiencing a similar development in the technology for artificial intelligence (AI) and machine learning (ML). And it looks very much like we're again getting a platform, followed by the respective engineering discipline to handle the complexity of the platform – just this time for AI/ML. Again, driven by the vendors and the natural desire to offer comprehensive solutions for current market demand.

Platforms cannot be put together like Lego bricks – Image by author

However, the great divide of data will grow if we install just onother platform that is solely oriented towards market demand. AI/ML applications are even more dependent on the universal data supply, ** which** I repeatedly address in my articles. Universal data supply is a concept that enables the provision of all company-relevant data for each application, regardless of the purpose of this application (operational, analytical, AI/ML, etc.).

We will not fulfil this need with yet another platform that again tries to be the better, all-encompassing environment for everything possible. And above all, we will fail if we continue to think in terms of a data platform that runs parallel to the upcoming AI/ML platform and the application platform. We must realise that this platform thinking is not a solution to our challenges at enterprise level. It's just a convenient way for IT and cloud computing vendors to bundle their products.

So if platform thinking can't save us from inefficient architectures, what can help us?

The Data Infrastructure

We need to steer away from platform thinking to applications or services interconnected by IT infrastructure. Wait a minute, that's not modern or new at all because this approach is standard and well-known as service oriented architecture (with microservices or monolithic applications/services) in the operational plane.

Well, I have to admit that this is true. It's the right approach for scaling business logic on enterprise level and we urgently need something similar for scaling the data concern. However, data is different and should not be treated as an application. We need a data infrastructure and not another platform for data-centric applications.

But let's look at the difference between platform and infrastructure first. Again the cloud computing models can help to clarify the difference. Infrastructure as a service (IaaS) offers basic computing resources like virtual machines, storage, and networking components. With IaaS, we have more control over the operating systems and applications they run, but we must manage and maintain the underlying infrastructure.

IaaS provides the raw computing resources, whereas DPaaS abstracts these resources to offer specialized data management and analytics services, relieving users of the complexities involved in setting up and maintaining the necessary infrastructure to run their data-centric services.

If we leave out the platform parts – namely development, deployment, and analytical services – but keep the data abstraction part, we are left with the data infrastructure functionality. An infrastructure that abstracts for the data concerns in the application.

Such an infrastructure allows applications to concentrate on the business logic, relieving the application developers from the technicalities of saving and reading shared data and allowing other applications access that data with full business context.

Let's look at this from the business side to understand what business people demand from such an infrastructure.

Simplified business view on the enterprise – Image by author

We see business processes that implement logic to achieve business goals and optionally store results as business data. Business data can be defined as data bundled with business context – aka metadata and schema – that allows business processes to exchange information and finally collaborate.

Each process can exchange necessary business data with every other process in the company via a channel. Overall, the enterprise has a channel for interacting with the outside world. Simple, isn't it?

Of course, the enterprise is more than just three processes that exchange information in a bidirectional way. Rather, it is a complex adaptive system that intensively interacts with the outside world. The internal processes (partly digitalized as applications) are numerous and never static, but constantly evolving based on decisions made by employees acting on behalf of the organisation. This makes the employees part of the adaptive system and overall greatly complicates the model.

But it wouldn't change the key insight that we can derive from it. So let's keep it simple, add another business process, just to demonstrate that it can grow at will and replace the bidirectional channels with an infrastructure.

The adapted data mesh acts as an infrastructure, not as a platform – Image by author

We can see that an adapted Data Mesh can act as a data infrastructure that serves as an implementation for all the bidirectional channels between applications/services.

In fact, data is distributed across the entire organisation. It is clear to see that there is data inside and outside of the business processes. Data on the inside is private and is only managed within the process. Data on the outside is what the process makes available to the enterprise. This data on the outside comprises everything needed by other processes to fulfil their business goals.

We clearly see that applications/services are the digital twins to their business processes. This is an obvious analogy that from my point of view is greatly underrated in the IT industry.

What can we learn from this analogy? And didn't Gartner stance the data mesh as "obsolete before plateau" in their Hype Cycle for Data Management, 2022? And how does the adapted data mesh differ from data mesh? And what about the data fabric that Gartner thinks will replace the data mesh?

I don't intend to comment on the ability of Gartner analysts to foresee the future nor will I decipher all the different data fabric and data mesh definitions in the industry. But I can explain my view on digital twins, the data fabric and how it compares to the adapted data mesh. So let's answer all these questions step by step.

Data empowers business

A digital twin of a business process is the digitalized logic that mirrors the business process functionality in real time. By leveraging data, simulations, and AI, it allows for monitoring, analysis, and optimization, enabling a live digital representation of business processes. This technology is typically mentioned in industries like manufacturing, healthcare, and smart cities for improving performance and efficiency of the product building process or simulations of real physical things.

But this view can actually also be applied to our well-known (micro)-services or applications, which can be seen as digital twins of their business processes.

A business process efficiently coordinates and manages a sequence of activities or tasks to fulfil a specific business goal. The functioning of a company can be understood as the interaction of all individual business processes to form a coherent whole. Each business process makes a small contribution to the realisation of the company's value proposition, which is usually the provision of products and services.

The business is process-driven and not data-driven as all companies eagerly try to transform to. This is also quite natural because we know what needs to be done to fulfil the company's value proposition. The whole endeavour is set in action when the customer places an order. After placing an order, the customer's journey typically involves order processing and fulfilment, where the company confirms the order, prepares, and ships the product, followed by delivery and post-purchase support to ensure customer satisfaction. Everything is process driven.

This is exactly what is used in business process modelling, where "Business Process Model and Notation" (BPMN) is used together with the "Business Process Execution Language" (BPEL) as a formalized specification and execution option for the business processes.

In fact, business process modeling and software engineering have large parallels, so it seems reasonable to directly apply the software development process to the modeling and implementation of business processes. The source code for implementing an IT process is strikingly similar to a document in BPMN / BPEL for implementing the business process. And the orchestration of business processes via a "Workflow Management System" (WfMS) closely corresponds to the automated workload and scheduling of complex IT processes and application systems.

In reality, however, completely separate IT specialist areas have developed with mutually incompatible languages and tools. People still seem to think that a formal specification (in BPMN / BPEL) of business processes is different from a specification (in source code) of applications. I think there are still a lot of synergies to be exploited if existing incompatibilities, organizational hurdles and walls between business process modelling and software engineering are torn down.

The orchestration of applications/services can therefore be seen as the managed deployment of digital twins for business processes. Input data triggers the chain of processes that are kept in motion by data exchange and flow through the enterprise. Universal data supply therefore enables the digital enterprise and empowers business by making data ubiquitous. It's the fundamental principle needed for a company to get data-empowered rather than just data-driven.

Data fabric or data mesh?

Unfortunately, there exists more than one definition to this thing called data fabric. But let's take the one from Gartner as a starter:

Data fabric is a data management design for attaining flexible, reusable and augmented data integration pipelines that utilizes knowledge graphs, semantics and active metadata-based automation in support of faster, and in some cases, automated data access and sharing regardless of deployment options, use cases (operational or analytical) and/or architectural approaches. It is not one single tool or technology.

Source, Gartner, Inc. 2024 – https://www.gartner.com/en/data-analytics/topics/data-fabric

Yes, my fellow data engineers and architects that's an analysts take on the data fabric. Admittedly, it is difficult to describe something so complex in a single sentence. But we can deal with the topic in more detail here. So let's delve into the definition and compare with the adapted data mesh.

By the way, if you want to get more information about the difference of the original and the adapted data mesh, I recommend reading my three-part series on it – you won't get a one-sentence definition, but you will be much better prepared for what follows.

Challenges and Solutions in Data Mesh

Data fabric is a data management design for attaining flexible, reusable and augmented data integration pipelines

Okay, so what is a data integration pipeline? Without an official definition, let's take mine:

A data integration pipeline (or short data pipeline) is a sequence of steps that move, process and integrate data from source to target systems.

We have data sources, data processing or transformation, data integration and target systems. It is not entirely clear whether these pipelines are part of the data fabric itself or whether the data fabric just helps to attain a flexible, reusable and augmented pipeline that is located elsewhere. But because the definition also says, that data fabric is not a single tool or technology, it seems to be the latter.

The adapted data mesh considers all business processes as implemented applications/services (or digital twins) that exchange data within the enterprise. The provision of data needed by down-stream processes should be actively addressed by the producing applications with data products.

Hence, the data infrastructure inside the mesh is a means to exchange data products (data with business context) between applications. We won't find the typical data pipelines anymore because applications/services are not part of the data infrastructure itself. Instead, any component implementing business logic need to be part of the overall IT application architecture interconnected by the adapted data mesh. Especially integration logic is to be addressed by the redefined data warehouse discipline.

This attains flexible, reusable and augmented data integration pipelines by decoupling them as applications/services from data concerns.

Data fabric utilizes knowledge graphs, semantics and active metadata-based automation

The adapted data mesh defines the ontology as the top-down enterprise data model to align bottom-up process-driven data models with the enterprise view. The governance processes as outlined in the third part of the series offer the guiding framework for the applications/services to participate.

This exactly addresses the utilization of semantics and knowledge graphs. The rich business context in a data product (metadata) can be used to actively foster automation.

Data fabric supports faster, and in some cases, automated data access and sharing

The adapted data mesh implements the universal data supply to ultimately foster global data sharing across all participating applications/services. The data infrastructure in the adapted data mesh abstracts for streaming and batch processing, allowing faster and greatly simplified sharing of data products over the adapted data mesh.

Data fabric provides all this, regardless of deployment options, use cases (operational or analytical) and/or architectural approaches

The adapted data mesh does provide a data abstraction that allows for easy and transparent exchange of data products for all kinds of applications, whether operational or analytical. It doesn't matter if the participating data producers and consumers are organized as decoupled microservices or monolithic applications.

Any component in the enterprise can use the abstractions offered by the adapted data mesh to participate in data sharing and thereby enabling the universal data supply.

Conclusion

As data engineers and architects for larger corporations we need to properly address the business requirements for information exchange between all types of applications and services as the digital twins of their business processes.

I explained that the concept of universal data supply addresses this key business requirement. It can be implemented by following the principles of the adapted data mesh, which provides a data infrastructure that fully complies with the definition of Gartner's data fabric.

This approach is fundamentally different to the current Platform thinking driven by the vendors. While pre-packaged data platforms can be of great benefit to corporations, we have to be careful not to confuse it with a clean data architecture for the whole enterprise.

If you find this information useful, please consider to clap. I would be more than happy to receive your feedback with your opinions and questions.

Tags: Data Engineering Data Mesh Data Science Platform Software Engineering