From Data Lakes to Data Mesh: A Guide to the Latest Enterprise Data Architecture

Author:Murphy  |  View: 30063  |  Time: 2025-03-23 18:28:17
Image by takahiro takuchi (Unsplash)

There are major ‘data earthquakes' underway at large organisations worldwide.

It is the decentralisation of the company's data lake towards data mesh.

At one of Australia's ‘Big Four' banks where I've worked in the analytics space for half a decade, we are smack middle of a huge transformation journey building out a number of big ticket infrastructure items all at once:

  • Migration towards a cloud-native data platform on Azure PaaS;
  • Construction of an array of strategic data products;
  • Federation of our data lake into a decentralised data mesh.

But I gotta admit… it's kind of disruptive for our poor old data analysts and data scientists.

Imagine trying to do calligraphy while builders are renovating your room.

Update: I now post analytics content on YouTube.

Data lakes – the hot destination data scientists prefer drinking from – are currently being pulled apart like dough and federated out to individual Business domains as part of the mesh dream.

Some data scientists are looking on with curiosity; others sigh with frustration; and then there are those who are excited beyond belief.

Why?

Because data mesh promises to be a truly scalable data platform where data is treated as a first-class citizen. Data scientists will have access to discoverable, reliable and reusable data assets that can be seamlessly shared between different business domains across the company.

Short-term pain for long-term gain, as they say.

In this article, I'll dive into how…

  • Data lakes became a bottleneck;
  • Why organisations are now decentralising their lakes into data mesh;
  • How to build a mesh infrastructure at your company.

1. A Brief History of Data Lakes

The enterprise data landscape has evolved really fast in the past ten years.

In the mid-2010's, data lakes began gaining popularity in companies around the world. The concept had actually existed for decades, but it was during this period that the technology required to build these centralised data storage beasts became feasible.

Bloody good timing too.

The explosion of smartphones, Internet of Things (IoT), digital & social media, and e-commerce converged into the rise of big data, and with this, a pressing need for organisations to store huge volumes of unstructured data and milk them for insights using data analytics and machine learning.

Data lakes offered a scalable and flexible solution without the need for pre-defined schemas, unlike data warehouses.

Intersection of data science, machine learning and big data. Image by author

What about the software to run on top of all this?

Enter Apache Hadoop, an open-source framework that offered distributed storage (HDFS) and distributed compute (MapReduce) capabilities for big data sitting in the lake.

Hadoop's origins rest all the way back in a pair of seminal papers published in the early 2000's by Yahoo! researchers. By 2010, Facebook boasted the biggest Hadoop cluster in the world, with 21 petabytes of storage.

A few years later, half of Fortune 50 companies had adopted the framework.

Hadoop's cost-effective storage of using cheap consumer machines (commodity hardware) and big data processing capabilities made it an attractive choice for organisations worldwide looking to build data lakes.

Predictive models trained on big data has exploded since the mid-2010's. Image by author

Next came cloud computing, which took off big time between 2015 to 2020.

Juggernauts like Amazon Web Services (AWS), Microsoft Azure and Google Cloud Platform (GCP) offered scalable storage solutions like Amazon S3, Azure Data Lake Storage and Google Cloud Storage.

As a result, many organisations migrated their data lakes sitting on-premise onto the cloud, enabling them to elastically adapt to changing workloads, scale up and down at a whim and pay for only what they use.

Microsoft now even offers SaaS cloud analytics platforms that bring together data warehousing, big data, data engineering and data management under one roof.

Happy days right?

Well…

2. The Data Lake Monster

Architect and data mesh inventor Zhamek Dehghani condensed the history of enterprise data platforms into three generations:

First generation: proprietary enterprise data warehouse and business intelligence platforms; solutions with large price tags that have left companies with equally large amounts of technical debt in thousands of unmaintainable ETL jobs, and tables and reports that only a small group of specialised people understand, resulting in an under-realized positive impact on the business.

Second generation: big data ecosystem with a data lake as a silver bullet; complex big data ecosystem and long running batch jobs operated by a central team of hyper-specialised data engineers have created data lake monsters that at best has enabled pockets of R&D analytics; over-promised and under-realised.

Third generation: more or less similar to the previous generation, with a modern twist towards streaming for real-time data availability with architectures, unifying the batch and stream processing for data transformation, as well as fully embracing cloud-based managed services for storage, data pipeline execution engines and machine learning platforms.

She really didn't mince her words.

Decades of data warehouses left organisations drowning in a sea of data systems connected by a mess of data pipelines. The magic solution was meant to be centralising data into a central repository. Unfortunately, the data lake dream devolved into data lake swamps across many organisations.

Overflowing with vast amounts of untapped data and unresolved data quality issues, the water became stale.

Excessively centralising anything can open a can of worms, often resulting in a dramatic swing back towards decentralisation after a period of turmoil. We see this across various aspects of human society:

  • Business → companies flattening towards a more horizontal hierarchy;
  • Finance → rise of decentralised assets like bitcoin, ethereum and solana;
  • Economics → planned economies adopting capitalism;
  • Government → decisive coup d'etat or a revolution by the people.

Enterprise data encountered its own taste of too much centralisation in the form of its data lake experiment.

Here were the consequences, which Dehghani described as ‘modes of failure'. Ouch.


Problem 1 – Centralised conundrums!

We got a centralised domain-agnostic jack-of-all-trades data platform managed by an overworked centralised data team who weren't experts on the data they were handling.

This hotshot data engineering team was expected to ingest operational and transactional data from all corners of the enterprise across all business domains.

They then had to cleanse, enrich and transform the data so that it can serve the needs of a diverse set of consumers, like dashboarding and reporting by data analysts and modelling by data scientists.

Yikes! That's a lot to ask for.

ETL pipelines into and out of data lakes. Built by a centralised data team. Source: Z. Dehghani at MartinFowler.com (with permission)

As organisations worldwide raced to become data-driven organisations in order to stay competitive, the requirement to handle all analytical questions from decision-makers across many business areas rested on the agility of this centralised team.

And that became a major problem.

These hapless data engineers didn't have the time.

With tweaks to operational databases came a trail of broken ETL pipelines into the centralised data lake.

The team had constant backlog of pipelines to patch up, leaving little time to focus on making sense of the domain-specific data they were pulling in from across the organisation or pushing out to hungry data scientists.

Central data team becoming a bottleneck. Source: Data Mesh Architecture (with permission)

As a result – after some initial quick wins, central data teams worldwide ran into tremendous scalability issues and became a bottleneck to organisational agility.

In summary – like a wild party, data warehouses spread technical debt like confetti by letting everyone and their pet hamster create data pipelines.

Then data lakes yanked the spotlight and choreographed a grand centralisation spectacle by squeezing the creation of data pipelines to a single point in the company.

Companies who now found themselves quenching the insatiable data thirst of an entire armada of analysts, data scientists and managers by squeezing the flow of data through a tiny straw manned by the central data team.

Takeaway:

Perhaps decentralising the data lake into domain-specific teams might be the Goldilocks sweet spot to aspire towards?


Problem 2 – Lethargic operating model

Dehghani described the second problem as coupled pipeline decomposition, where data lakes…

"…have a high coupling between the stages of the pipeline to deliver an independent feature or value. It's decomposed orthogonally to the axis of change."

This isn't easy to understand unless you're an architectural guru, so let me use a personal example.

A few years ago, the bank I work at uprooted our entire organisational structure from a functional operating model into a lines of business model.

What does that mean? I'll illustrate by focusing on poor mortgage sales performance – a core business for any bricks and mortar bank.

Under the functional model, someone was in charge of mortgage product development, someone else was responsible for distribution and sales, and someone else was responsible for things like legal, risk and compliance.

Image by author

This led to a lack of accountability for poor performance, because these functions are highly-coupled and require these inter-functional teams to work with each other in order to deliver end-to-end mortgage products.

Put it another way, the axis of change is in the direction of products: mortgages, credit cards, business lending, etcetera, yet the work pipelines were decomposed orthogonally to that by chopping up each product into a bunch of highly-coupled functions.

Bad bad bad.

Under the new LoB model, a single executive is responsible for the mortgages business. A single executive takes care of credit cards. A single executive steers business lending.

Image by author

Pipelines are now parallel to the axis of change. The rudder and the ship are now aligned.

This drives:

  • Accountability – no bonus for Mr. C-Suite if mortgages underperforms.
  • Agility – the mortgage business can quickly reorganise around new projects and innovate. Change is easier and faster.
  • Performance – agility and innovation results in better products and services that's delivered faster.

Back to data lakes.

It turns out that they also inherently use a functional approach to organising work, where data pipelines are decomposed into processing stages, such as sourcing, ingesting, processing and serving data.

Processing stages in a data pipeline. Source: Z. Dehghani at MartinFowler.com (with permission)

And like my banking example, these processing stages are highly-coupled.

Does that data science team building a model on credit scoring suddenly need to be served different data? That means different data needs to be processed, which might mean new data needs to be sourced and ingested.

This means that our centralised data team needs to manage a slew of constantly evolving dependencies, resulting in a slower delivery of data and a lack of agility.

In fact, these dependencies mean that the entire pipeline through the data lake is the smallest unit of change that must be modified to cater for a new functionality.

Hence we say that data lakes are monolithic data platforms.

It's really one big piece that's hard to change and upgrade, which, as Dehghani contends:

"…limits our ability to achieve higher velocity and scale in response to new consumers or sources of the data."

Takeaway:

Could we address this by decentralising the data lake into a more modular architecture with business domains taking end-to-end responsibility for their data?


Problem 3 – Fence-throwing

Dehghani calls the third and final mode of failure siloed and hyper-specialised ownership, which I like to think as resulting in unproductive fence-throwing.

Our hyper-specialised big data lake engineers working in the data lake are organisationally-siloed away from where the data originates and where it will be consumed.

Siloed hyper-specialised data platform team. Source: Z. Dehghani at MartinFowler.com (with permission)

This creates a poor incentive structure that does not promote good delivery outcomes. Dehghani articulates this as…

"I personally don't envy the life of a data platform engineer. They need to consume data from teams who have no incentive in providing meaningful, truthful and correct data. They have very little understanding of the source domains that generate the data and lack the domain expertise in their teams. They need to provide data for a diverse set of needs, operational or analytical, without a clear understanding of the application of the data and access to the consuming domain's experts.

What we find are disconnected source teams, frustrated consumers fighting for a spot on top of the data platform team backlog and an over stretched data platform team."

Data producers will ‘pack together' some of their data and throw it over the fence to the data engineers.

Your problem now! Good luck guys!

Overworked data engineers, who may or may not have done justice to the ingested data given that they're not data domain experts, will themselves throw some processed data out of the lake to serve downstream consumers.

Good luck, analysts and data scientists! Time for a quick nap and then I'm off to fix the fifty broken ETL pipelines on my backlog.

As you can see from Problems 2 and 3, the challenges that have arisen from the data lake experiment are as much organisational as technological.

Takeaways:

By federating data management to individual business domains, perhaps we could foster a culture of data ownership and collaboration and empower data producers, engineers and consumers to work together?

And hey, can we give these domains a real stake in the game?

Empower them to take pride in building strategic data assets by incentivising them to treat data like a hot-selling product?

Enjoying this story? Get an email when I post similar articles.

3. Introducing…Data Mesh!

In 2019, Dehghani proposed data mesh as the next-generation Data Architecture that embraces a decentralised approach to data management.

Her initial articles – [here](https://martinfowler.com/articles/data-mesh-principles.html) and here – generated significant interest in the enterprise data community that has since prompted many organisations worldwide to begin their own data mesh journey, including mine.

Rather than pump data into a centralised lake, Data Mesh federates data ownership and processing to domain-specific teams that control and deliver data as a product, promoting easy accessibility and interconnectivity of data across the entire organisation, enabling faster decision-making and promoting innovation.

Overview of data mesh. Source: Data Mesh Architecture (with permission)

The data mesh dream is to create a foundation for extracting value from analytical data at scale, with scale being applied to:

  • An ever-changing business, data and technology landscape.
  • Growth of data producers and consumers.
  • Varied data processing requirements. A diversity of use cases demand a diversity of tools for transformation and processing. For instance, real-time anomaly detection might leverage Apache Kafka; an NLP system for customer support often leads to Data Science prototyping on Python packages like NLTK, image recognition leverages deep learning frameworks like TensorFlow & PyTorch; and the fraud detection team at my bank would love to process our big data with Apache Spark.

All these requirements have created technical debt for warehouses (in the form of a mountain of unmaintainable ETL jobs) and a bottleneck for data lakes (due to the mountain of diverse work that's squeezed through a small centralised data team).

Organisations eventually behold a threshold mountain of complexity where the technical debt outweigh the value provided.

It's a terrible situation.

To address these problems, Dehghani proposed four principles that any data mesh implementation must embody in order to realise the promise of scale, quality and usability.

The 4 Principles of Data Mesh. Source: Data Mesh Architecture (with permission)
  1. Domain Ownership of Data: By placing data ownership in the hands of domain-specific teams, you empower those closest to the data to take charge. This approach enhances agility to changing business requirements and effectiveness in leveraging data-driven insights, which ultimately leads to better and more innovative products and services, faster.
  2. Data as a Product: Each business unit or domain is empowered to infuse product thinking to craft, own and improve quality and reusable data products – a self-contained and accessible data set treated as a product by the data's producers. The goal is to publish and share data products across the data mesh to consumers sitting in other domains – considered as nodes on the mesh – so that these strategic data assets can be leveraged by all.
  3. Self-Serve Data Platform: Empowering users with self-serve capabilities paves the way for accelerated data access and exploration. By providing a user-friendly platform equipped with the necessary tools, resources, and services, you empower teams to become self-sufficient in their data needs. This democratisation of data promotes faster decision-making and a culture of data-driven excellence.
  4. Federated Governance: Centralised control stifles innovation and hampers agility. A federated approach ensures that decision-making authority is distributed across teams, enabling them to make autonomous choices when it counts. By striking the right balance between control and autonomy, you foster accountability, collaboration and innovation.

Of the four principles, data products are the most crucial. As a result, we often see companies execute their data product strategy in tandem with decentralising their data lake across their individual business domains.

Read my Explainer 101 on data products for all the juicy details.

On the topic of executing a data mesh strategy…

4. How to Build a Data Mesh

For most companies, the journey won't be clean and tidy.

Building a data mesh won't be a task relegated to a siloed engineering team toiling away in the basement until it's ready to deploy.

You'll likely need to cleverly federate your existing data lake piece-by-piece until you reach a data platform that is ‘sufficiently mesh'.

Think swapping out two aircraft engines for four smaller ones mid-flight, rather than building a new plane in a nice shady hanger somewhere.

Or trying to upgrade a road while keeping some lanes open to traffic, instead of paving a new one in parallel nearby and cutting the red ribbon once everything is nice and dandy.

Building a data mesh is a substantial undertaking and you'll need to bring the business along on the ride at all stages. Because it's the business domains that will ultimately be in charge of their own end-to-end data affairs!

Full data mesh maturity may take a long time, because mesh is principally an organisational construct.

It is just as much about operating models – in other words, people and processes – as the technology itself, meaning cultural uplift and bringing people along for the journey is essential.

You need to teach the organisation the value of mesh and how to use it.

Play your cards right, and over time your centralised domain-agnostic monolithic data lake will morph into a decentralised domain-oriented modular data mesh.

Some considerations for the design phase. Check out datamesh-architecture.com for a deeper dive.

  • Domains. A data mesh architecture comprises a set of business domains, each with a domain data team who can perform cross-domain data analysis on their own. An enabling team – often part of the transformation office of the organisation – spreads the idea of mesh across the organisation and serve as advocates. They help individual domains on a consultancy basis on their journey to become a ‘full member' of the data mesh. The enabler team will comprise experts on data architecture, data analytics, data engineering and data governance.
  • Data products. Domains will ingest their own operational data – which they sit very close to and understand – and build analytical data models as data products that can be published on the mesh. Data products are owned by the domain, who is responsible for its operations, quality and uplift during its entire lifecycle. Effective accountability to ensure effective data.
The sharing of data products across the mesh. Source: Data Mesh Architecture (with permission)
  • Self-serve. Remember those ‘multicultural food days' at school, where everyone brought their delicious dishes and shared them at a self-serve table? The teacher's minimalist role was to oversee operations and ensure everything went smoothly. In a similar vein, mesh's newly streamlined central data team endeavour to provide and maintain a domain-agnostic ‘buffet table' of diverse data products from which to self-serve. Business teams can perform their own analysis with little overhead and offer up their own data products to their peers. A delicious data feast where everyone can also be the chef.
  • Federated governance. Each domain will self-govern their own data and be empowered to walk at the beat of its own drum – like European Union member states. On certain matters where it makes sense to unite and standardise, they will strike agreements with other domains on global policies, such as documentation standards, interoperability and security in a federated governance group – like the European Parliament – **** so that individual domains can easily discover, understand, use and integrate data products available on the mesh.

Here's the exciting bit – when will our mesh hit maturity?

The mesh emerges when teams start using other domain's data products.

This serves as a useful benchmark to aim for to attest that your data mesh journey has reached a threshold level of maturity.

A good time to pop the champagne!

5. Final Words

Data mesh is a relatively new idea, having only been invented around 2018 by architect Zhamek Dehghani.

It has gained significant momentum in the data architecture and analytics communities as an increasing number of organisations grapple with the scalability problems of a centralised data lake.

By moving away from an organisational structure where data is controlled by a single team and towards a decentralised model where data is owned and managed by the teams that use it the most, different parts of the organisation can work independently – with greater autonomy and agility – while still ensuring that the data is consistent, reliable and well-governed.

Data mesh promotes a culture of accountability, ownership and collaboration, where data is productised and treated as a first-class citizen that's proudly shared across the company in a seamless and controlled manner.

The aim is attaining a truly scalable and flexible data architecture that aligns with the needs of modern organisations where data is central to driving business value and innovation.

Summarising the Four Principles of Data Mesh. Credit: Z. Dehghani at MartinFowler.com (with permission)

My company's own journey towards data mesh is expected to take a couple of years for the main migration, and longer for full maturity.

We're working on three major parts simultaneously:

  • Cloud. An uplift from our Cloudera stack on Microsoft Azure IaaS to native cloud services on Azure PaaS. More info here.
  • Data products. An initial array of foundational data products are being rolled out, which can be used and re-assembled in different combinations like Lego bricks to form larger more valuable data products.
  • Mesh. We're decentralising our data lake to a target state of at least five nodes.

What a ride it has been. When I started half a decade ago, we were just getting started building out our data lake using Apache Hadoop on top of on-prem infrastructure.

Countless challenges and invaluable lessons have shaped our journey.

Like any determined team, we fail fast and fail forward. Five short years later, we have completely transformed our enterprise data landscape.

Who knows what things will look like in another five years? I look forward to it.

Find me on Twitter & YouTube [[here](https://youtube.com/@col_shoots)](https://youtube.com/@col_invests), here & here.

My Popular AI, ML & Data Science articles

  • AI & Machine Learning: A Fast-Paced Introduction – here
  • Machine Learning versus Mechanistic Modelling – here
  • Data Science: New Age Skills for the Modern Data Scientist – here
  • Generative AI: How Big Companies are Scrambling for Adoption – here
  • ChatGPT & GPT-4: How OpenAI Won the NLU War – here
  • GenAI Art: DALL-E, Midjourney & Stable Diffusion Explained – here
  • Beyond ChatGPT: Search for a Truly Intelligence Machine – here
  • Modern Enterprise Data Strategy Explained – here
  • From Data Warehouses & Data Lakes to Data Mesh – here
  • From Data Lakes to Data Mesh: A Guide to Latest Architecture – here
  • Azure Synapse Analytics in Action: 7 Use Cases Explained – here
  • Cloud Computing 101: Harness Cloud for Your Business – here
  • Data Warehouses & Data Modelling – a Quick Crash Course – here
  • Data Products: Building a Strong Foundation for Analytics – here
  • Data Democratisation: 5 ‘Data For All' Strategies – here
  • Data Governance: 5 Common Pain Points for Analysts – here
  • Power of Data Storytelling – Sell Stories, Not Data – here
  • Intro to Data Analysis: The Google Method – here
  • Power BI – From Data Modelling to Stunning Reports – here
  • Regression: Predict House Prices using Python – here
  • Classification: Predict Employee Churn using Python – here
  • Python Jupyter Notebooks versus Dataiku DSS – here
  • Popular Machine Learning Performance Metrics Explained – here
  • Building GenAI on AWS – My First Experience – here
  • Math Modelling & Machine Learning for COVID-19 – here
  • Future of Work: Is Your Career Safe in Age of AI – here

Tags: Business Data Architecture Data Mesh Data Science Machine Learning

Comment