Data Engineering Patterns

Author:Murphy  |  View: 28845  |  Time: 2025-03-22 21:42:36
Sustainable Technology – Image by Author generated by DALLE-3
Photo by Isaac Garcia on Unsplash

How should developers and data practitioners start to incorporate environmental factors when developing end-to-end data solutions?

In this article, we approach Sustainability from a data practitioner's point of view, raising awareness on the magnitude of environmental implications of data handling activities. We identify inefficient patterns leading to increased energy consumption and delve into strategies to tackle them. The focus will be on code, design strategies, compute optimization and additional actionable sustainable patterns that contribute to the end goal of sustainability.


Table of Contents

1. Energy is at the core of every application (The Why…)

2. Significant energy consumption is the consequence of unsustainable engineering patterns (The What…)

3. Sustainable Data Engineering Patterns (The How…)

Summary References


Energy is at the core of every application (The Why…)

Energy consumption is at the core of every data transaction between a user and an application. The bigger the data the more energy applications consume. This continuous global growth in development is driving the increase in energy consumption and consequently more CO2 emissions. Especially within the tech-driven ICT (information and communication technology) sector taking responsibility for 4% of CO2 emission across the globe.

Fast delivery and time to market, especially in IT start-up environments where the priority stands for business continuity and getting things done fast, often results in inefficient development practices and complex IT infrastructure consuming massive amounts of energy. The misconception that such emissions are negligible, combined with high costs and time needed to develop energy-efficient infrastructure and code, has led companies to de-prioritize efforts toward sustainable improvements.

The boom of AI in the past years brings additional costs to environmental considerations. The energy it takes to train Large Language models is reaching new heights, with 170B parameters models estimated to consume up to 50.5 tons of CO2. Time to market for applications is now faster than ever with AI agents equipped with tools for automated development and deployment of solutions. AI will reduce the time developers take to bring energy consuming applications live, potentially breaking new records for the number of applications running worldwide. Some nonetheless can argue that AI can play a key role in optimizing code and infrastructure handling towards more sustainable solutions.

Companies can mitigate such consumption by embedding Sustainable Development into their applications through optimization of software and data processing at infrastructure, code, and user experience levels. Employing such strategies not only leads to a significant reduction in the carbon footprint but also contributes to the global goal of sustainability.


Significant energy consumption is the consequence of unsustainable engineering patterns (The What…)

Factors contributing to energy consumption are distributed across the data processing workflow. Each stage can incorporate patterns that result in increased CO2 emissions. It is important to list and discuss such patterns to develop awareness and understanding of their impact in the event of evaluation of such designs.

Unsustainable Data Engineering patterns- Image by Author

Data Ingestion

Regarded as one of the first steps to take on the data handling ladder. A crucial stage marking the beginning of engineers and analysts' careers, making it susceptible to inefficient and energy-consuming patterns.

Pattern 1 – Fields Forever United: Data sources often contain more data than required for analytical use cases. This pattern involves skipping ahead to the "**Select *** " statement and ingesting all data in the hope that it will be needed at some point in the future. An easy decision with a big environmental impact. Such a pattern leads to increased compute requirements for data movement (both for source systems and consumer applications) and additionally more storage capacity to host unutilized data.

Pattern 2 – All or Nothing: A productive system with a few connected IoT devices, and a fair customer base will continuously generate and update data. Skipping crucial efforts to design the right architecture for efficient data consumption, will result in redundant solutions loading data batches as 1-to-1 copies from source systems. For a certain data size that might prove efficient, yet for large data this quickly becomes inefficient and unsustainable.

Pattern 3 – Round the Clock: Data can be created and updated at irregular intervals. Some solutions choose to tackle such irregularities with brute force approaches ensuring that every time slot is covered. Scheduling pipelines to trigger at such short intervals can lead compute to be provisioned, which is a recipe for increased energy consumption.

Pattern 4 – Let's Try Again: What happens when a data import fails? Let's try again? might work this time. Many implementations incorporate automatic retry mechanisms that ensure production systems are robust enough to recover from occasional failures. Data pipelines on the other hand if not engineered accurately can incur large costs and heavy consumption relying on such mechanisms. Specifically, the failure to consider already completed progress.

All listed patterns converge to sustainable, robust, and efficient solutions that we discuss in later sections. For now, we continue delving into unsustainable patterns discussing data transformations at the next stage in the data lifecycle.

Data Transformation

Raw data requires transformative stages toward reaching a data product-ready state.

Pattern 5 – Old and Stale: The existence of data does not guarantee its freshness. Data transformation stages are often scheduled or triggered without data redundancy validations. Leading to redundant energy-consuming stale data pipelines. While some transformations may require low computing resources others involving large datasets can incur substantial costs and energy if not optimized accordingly.

Pattern 6 – Inequality: Low compute utilization is a key indicator of inequality. Successful pipeline runs are an engineer's beginning, not the finish line. Compute resources are often discarded in favor of ensuring successful runs. Such scenarios exhibit inefficient utilization of computing resources, where multi-node clusters are up and running with only a few nodes performing all the heavy lifting.

Pattern 7 – Algorithmic Evolution: Compute power does not justify bad code. The amount of computing power one has access to nowadays provides just enough performance for many to avoid further investment in code optimization. Down the road, the consequences manifest as pile of technical debt that hinders growth and further scalability.

Tags: Data Engineering Data For Change Software Development Sustainability Sustainable Development

Comment