A Guide to Data Engineering Infrastructure

Author:Murphy | View: 23814 | Time: 2025-03-22 23:10:22

Modern data stacks consist of various tools and frameworks to process data. Typically it would be a large collection of different cloud resources aimed to transform the data and bring it to the state where we can generate data insights. Managing the multitudes of these data processing resources is not a trivial task and might seem overwhelming. The good thing is that data engineers invented a solution called infrastructure as code. So essentially it is coding that helps us to deploy, provision and manage all resources we might ever need in our data pipelines. In this story, I would like to discuss popular techniques and existing frameworks that aim to simplify resource provisioning and data pipeline deployments. I remember how at the very beginning of my data career I deployed data resources using the web user interface, i.e. storage buckets, security roles, etc. Those days are long gone but I still remember the joy and happiness when I learned that it could be done programmatically using templates and code.

Modern data Stacks

What would that be – a Modern Data Stack (MDS)? The technologies that are specifically used to organise, store, and manipulate data would be something that makes up a modern data stack [1]. This is what helps to shape the modern and successful data platform. I remember I raised this discussion in one of the previous stories.

Data Platform Architecture Types

A simplified data platform blueprint often looks like this:

Simplified Data platform blueprint. Image by author.

It usually contains dozens of different data sources and cloud platform resources to process them.

There might be different data platform architecture types depending on business and functional requirements, skillset of our users, etc. but in general infrastructure design goes in several data processing layers defining the resources at each stage:

External and internal data sources – APIs (CRM systems, Accounting software, etc.), Relational databases (MySQL, Postgres), cloud storage and many others we might want to create ourselves.
Extract – some services that will do the data extraction work, i.e. Cloud Functions, Compute instances, Database migration services and Change Data Capture (CDC) systems, etc.
Transform and Load – ELT / ETL services that can actually transform our data, handle different file formats and orchestrate data ingestion into a data warehouse solution (DWH) or data lake. It can be anything really that can perform efficient data manipulation, i.e. PySpark application built in Databricks or AWS Glue service. It also can be a tiny Cloud Function invoked by message queue events (Rabbit, SQS, Pub/Sub, Kafka, etc.)
Storage – it all depends on our data pipeline design pattern [2] and can be a data lake, DWH or OLAP database. It would typically perform a storage function, i.e. landing area for data files and be a proxy stage of many other pipelines.

Data pipeline design patterns

Personally, I would never consider storage as a starting point of my data pipeline as it doesn't generate data per se but many engineers do.

Sometimes it makes sense when we don't have access to the original data source that outputs data into this storage.

Data governance – sometimes it is useful to place a DWH solution into a separate data processing stage. Indeed, data management, role-based access controls and robust data governance features – are all that make these tools very practical in modern data stacks [3]. I discussed it here:

Modern Data Warehousing

Querying data using SQL unifies Analytics across teams. It's true that many people in data are not proficient with coding and therefore won't be able to perform effective data manipulation in data lakes with various data types (semi-structured, unstructured).

It is not a secret that some data analysts might experience difficulties with JSON format

Now imagine we have all sorts of data formats in the data lake – AVRO, Parquet, ORC [4] and unstructured images and texts.

Big Data File Formats, Explained

Orchestration – we need something to orchestrate all that madness. Sometimes it might be a good idea to move the data pipeline orchestrator into a separate stage or a stack. Usually, it sits in my "transform and load" but I know many companies do it this way. It makes perfect sense when we have a large number of data pipelines. One of my previous tutorials on such orchestration service can be found here [5]:

Data Pipeline Orchestration

Business intelligence (BI) and reporting – this is what delivers the value from data. We all like nicely looking reports and dashboards BI developers kindly create for us. There is a variety of tools in the market – Looker, Mode, Sisense, etc. They will comprise a modern BI stack. A more or less comprehensive list with major features can be found here:

Looker Data Studio (Google Looker Studio)

Key features:

Free version formerly called Google Data Studio. This is a great free tool for BI with community-based support.
Great collection of widgets and charts
Great collection of community-based data connectors
Free email scheduling and delivery. Perfectly renders reports into an email.
Free data governance features
As it's a free community tool it has a bit of undeveloped API

Looker (paid version)

Key features:

Robust data modelling features and self-serving capabilities. Great for medium and large size companies.
API features

Tableau

Key features:

Outstanding visuals
Reasonable pricing
Patented VizQL engine driving its intuitive analytics experience
Connections with many data sources, such as HADOOP, SAP, and DB Technologies, improving data analytics quality.
Integrations with Slack, Salesforce, and many others.

AWS Quicksight

Key features:

Custom-branded email reports
Serverless and easy to manage
Robust API
Serverless auto-scaling
Pay-per-use pricing

Power BI

Key features:

Excel integration
Powerful data ingestion and connection capabilities
Shared dashboards from Excel data made with ease
A range of visuals and graphics is readily available

Sisense (former Periscope)

Sisense is an end-to-end data analytics platform that makes data discovery and analytics accessible to customers and employees alike via an embeddable, scalable architecture.

Key features:

Offers data connectors for almost every major service and data source
Delivers a code-free experience for non-technical users, though the platform also supports Python, R, and SQL
Git integration and custom datasets
Might be a bit expensive as it's based on pay per license per user model
Some features are still under construction, i.e. report email delivery and report rendering

ThoughtSpot

Key features:

Natural language for queries

Mode

Key features:

CSS design for dashboards
Collaboration features to allow rapid prototyping before committing to a premium plan
Notebook support
Git support

Metabase

Key features:

Great for beginners and very flexible
Has a docker image so we can run it straight away
Self-service Analytics

Redash

Key features:

API
Write queries in their natural syntax and explore schemas
Use query results as data sources to join different databases

Deploying data pipelines

This would be a conventional data flow or data platform design for many companies. The challenge is how to deploy data pipelines and how to manage all the resources associated with each of them. For instance, we might be tasked to deploy a machine learning pipeline in production and staging environments. So how do we manage all the resources it will use? It is not a trivial task but once managed with infrastructure as code it makes our code more or less scalable and easy to maintain. I previously published a tutorial here [6]:

Orchestrate Machine Learning Pipelines with AWS Step Functions

I think this is a good example that demonstrates how difficult it might be to maintain our data pipeline codebase. So the complexity of code is one of the main factors.

Our data pipelines need to be designed this way so they would be easy to test, deploy and maintain in different environments.

Now imagine the scenario when our data volumes grow rapidly. This might be a result of our application success or any other reason but our data pipelines must be ready for this. In other words, we would want to provision required resources for them with ease when needed.

As a result, the scalability of our infrastructure is another challenging factor.

Another reason that drove me to start using deployments with infrastructure as code was improved monitoring and visibility for my data pipelines. I remember the days (very old days) at University when I was only embarking on my career journey. I deployed AWS Lambda and scheduler application on EC2 instances mostly with shell scripts.

It was incredibly hard to identify data pipeline outages.

Infrastructure as code, whether it is CloudFormation or Terraform, helps to solve this problem as well. Consider this yaml template snippet below. It creates the resources for data pipeline alerts and notifications via email:

# stack.yaml
AWSTemplateFormatVersion: '2010-09-09'
Description: AWS S3 data lake stack with Lambda.
Parameters:
 ...
 Email:
   Type: String
   Description: Email address to notify when Lambda has triggered an alarm
   Default: [email protected]

Resources:
   ...
   ...
   ...
 AlarmNotificationTopic:
   Type: AWS::SNS::Topic
   Properties:
     Subscription:
     - Endpoint:
         Ref: Email
       Protocol: email
 PipelineManagerLambdaLogGroup:
   DeletionPolicy: Retain
   Type: AWS::Logs::LogGroup
   Properties:
     RetentionInDays: 7
     LogGroupName: /aws/lambda/pipeline-manager
 PipelineManagerLambdaERRORMetricFilter:
   Type: 'AWS::Logs::MetricFilter'
   DependsOn: PipelineManagerLambdaLogGroup
   Properties:
     LogGroupName: /aws/lambda/pipeline-manager
     FilterPattern: ?ERROR ?Error ?error
     MetricTransformations:
       - MetricValue: '1'
         MetricNamespace: pipeline-manager # !Ref ServiceName
         MetricName: ERRORCount
  PipelineManagerLambdaERRORMetricAlarm:
   Type: 'AWS::CloudWatch::Alarm'
   DependsOn: PipelineManagerLambdaERRORMetricFilter
   Properties:
     AlarmDescription: Trigger the alarm when ERROR greater than 5 for 5 consecutive minutes.
     Namespace: pipeline-manager # !Ref ServiceName
     MetricName: ERRORCount
     Statistic: Sum
     Period: '60'
     EvaluationPeriods: '5'
     ComparisonOperator: GreaterThanThreshold
     Threshold: '0'
     AlarmActions:
       - !Ref AlarmNotificationTopic

Some other obvious reasons to shift towards infrastructure as code would be a lack of version control, long development cycles and making our data pipelines more easily reproduced in other environments and accounts.

Indeed, the majority of data services I deploy daily are data-intensive and need to be tested properly. The code is complex and development takes time.

We just can't afford any delays in either deployment or in detecting potential issues in our data pipelines.

Integration testing is another thing that is extremely complicated when we don't use infrastructure as code. It's great when we have a module or a library to spin up some service locally but with many cloud vendors, this is impossible. So we need to create an extra environment to run tests. We can consider it as a sandbox. A good example would be a scenario when we pull a Docker image with a MySQL instance and run it locally to perform an integration test. Now consider the same scenario but when we have Snowflake as a data source. We don't have that luxury in this case.

I remember I stopped deploying my data pipelines using UI tools a while ago. Basically, because it turned my life into a nightmare when I needed to reproduce the data pipeline. Simple copy exercises became extremely difficult to cope with. It doesn't matter which cloud service provider it was. I work with AWS and GCP mainly.

Git, git and more git

Version control is a must in data pipeline development. It eliminates any potential human errors as can be easily unit-tested in continuous integration (CI) pipelines [7] and makes it easy to revert and make changes.

Continuous Integration and Deployment for Data Platforms

All these reasons made me a big fan of modern infrastructure as code frameworks such as CloudFormation (AWS) and Terraform from HashiCorp. The latter one is platform agnostic and my personal favourite due to its robust debugging and validating capabilities.

Main infrastructure as code tools

As of the moment of writing this article, there are two main infrastructure code tools available in the market. Both AWS CloudFormation and Terraform help to improve the Data Engineering experience and make the data pipeline resources we create easy to maintain and reproduce in different environments.

All data pipelines should be written in code

Both tools aim to deliver sustainable change management for any modern data stack resources we create. We can deploy anything – new users, credentials, service account keys, security-based roles and policies, cloud storage buckets, data delivery streams and RDS instances. Basically, everything that exists in the cloud vendor list of infrastructure resources can be created with Infrastructure As Code.

Designing data-intensive applications in real-world scenarios is a complex task. CloudFormation and Terraform both make it easier to reproduce the pipeline resources using a modular approach. In the case of CloudFformation that would be a stack-based approach but the idea is the same – simplify the deployment of resources required by the data pipeline and make it more declarative for each environment.

Imagine we have a typical data pipeline that consists of an AWS S3 storage bucket and a Lambda function that needs to be invoked when a new data file lands in this simplified version of a data lake. Using Terraform we can describe these resources and deploy them on any platform:

# terraform/module/lambda/lambda.tf
resource "aws_lambda_function" "api_connector" {
  function_name = "some-api-connector-dev"

  s3_bucket = aws_s3_bucket.lambda_bucket.id
  s3_key    = aws_s3_object.lambda_api_connector.key

  runtime = "python3.9"
  handler = "app.lambda_handler"

  source_code_hash = data.archive_file.lambda_api_connector.output_base64sha256

  role = aws_iam_role.lambda_exec.arn
}

and Lambda resources we need including S3 bucket and access permissions can be described like so:

# terraform/module/lambda/resources.tf
resource "random_pet" "lambda_bucket_name" {
  prefix = "api-connector-functions"
  length = 4
}

resource "aws_s3_bucket" "lambda_bucket" {
  bucket = random_pet.lambda_bucket_name.id
}

resource "aws_s3_bucket_ownership_controls" "lambda_bucket" {
  bucket = aws_s3_bucket.lambda_bucket.id
  rule {
    object_ownership = "BucketOwnerPreferred"
  }
}

resource "aws_s3_bucket_acl" "lambda_bucket" {
  depends_on = [aws_s3_bucket_ownership_controls.lambda_bucket]

  bucket = aws_s3_bucket.lambda_bucket.id
  acl    = "private"
}

data "archive_file" "lambda_api_connector" {
  type = "zip"

  source_dir  = "./../../api-connector"
  output_path = "./../../api-connector.zip"
}

resource "aws_s3_object" "lambda_api_connector" {
  bucket = aws_s3_bucket.lambda_bucket.id

  key    = "api-connector.zip"
  source = data.archive_file.lambda_api_connector.output_path

  etag = filemd5(data.archive_file.lambda_api_connector.output_path)
}

resource "aws_cloudwatch_log_group" "api_connector" {
  name = "/aws/lambda/${aws_lambda_function.api_connector.function_name}"

  retention_in_days = 30
}

resource "aws_iam_role" "lambda_exec" {
  name = "api_connector_lambda_role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Sid    = ""
      Principal = {
        Service = "lambda.amazonaws.com"
      }
      }
    ]
  })
}

resource "aws_iam_role_policy_attachment" "lambda_policy" {
  role       = aws_iam_role.lambda_exec.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
}

Our stack folder will look like this:

.
├── api-connector
│   └── app.py
├── api-connector.zip
├── terraform
│   ├── environment
│   │   └── dev
│   │       ├── backend.tf
│   │       ├── main.tf
│   │       ├── provider.tf
│   │       ├── variables.tf
│   │       └── versions.tf
│   └── module
│       └── lambda
│               ├── lambda.tf
│               ├── resources.tf
│               └── variables.tf
└── variables

So when we simply run terraform apply (dev folder) in our command line it will deploy all required resources for our dev environment, i.e. AWS Lambda called api-connector-dev and all associated resources.

terraform plan command will actually display the resources that will be deployed in the command line:

Terraform will perform the following actions:

  # module.deployer_v1.aws_lambda_function.api_connector will be updated in-place
  ~ resource "aws_lambda_function" "api_connector" {
        id                             = "api-connector-dev"
      ~ last_modified                  = "2024-01-16T18:38:15.000+0000" -> (known after apply)
      ~ source_code_hash               = "Cys2bJUS5pa5poXdSkhIYEbrTxOAQJ3T/GOz/w4r2nM=" -> "LwQBYsLFStFSpm7JdDpoUhHV+7keulZklauMe/csOPk="
        tags                           = {}
        # (21 unchanged attributes hidden)

        # (2 unchanged blocks hidden)
    }

  # module.deployer_v1.aws_s3_object.lambda_api_connector will be updated in-place
  ~ resource "aws_s3_object" "lambda_api_connector" {
      ~ etag                   = "229e2cd7806bcbe3dbea54410b1461e2" -> "2469717ac466f9d8dc83e1e0e0846d45"
        id                     = "api-connector.zip"
        tags                   = {}
      + version_id             = (known after apply)
        # (10 unchanged attributes hidden)
    }

Plan: 0 to add, 2 to change, 0 to destroy.

In a similar way, we can create an S3 bucket for our data lake and CloudWatch event notifications to the SQS topic to invoke AWS Lambda on any s3:ObjCreated event.

The good thing is that Terraform is platform agnostic and can be used on any cloud platform.

This is the main difference compared to CloudFormation which was designed for AWS only.

What is the difference between Terraform and CloudFormation?

Long story short, Terraform is an open-source tool that simplifies resource provisioning in different cloud vendor environments while AWS CloudFormation was designed for AWS resources only.

I really like the terraform plan command which helps to debug my data platform resources stacks.

In my experience, Terraform has more advanced Dynamic features and Functions that make it easier to use for data engineering. For example, in Terraform we can use count and for_each to create multiple resources dynamically which I find very useful.

AWS CloudFormation has some intrinsic functions, i.e. Fn::Split to use which is great. However, it looks a bit limited compared to Terraform's [7] list of String, Numeric, Collection, Date and Time, Crypto and Hash, Filesystem, IP Network, Encoding and Type Conversion functions.

Ultimately it depends on your development stack but both tools are great for data engineering.

Conclusion

The development of data-intensive applications is not a trivial task. It requires an accurate and reproducible resource provisioning. Modern Data Stacks contain multitudes of those. Data engineers deploy data pipelines in different environments to test ETL processes before they go live. I design and deploy data pipelines daily and infrastructure as code makes my deployments declarative and easy to maintain. I know which resources belong to my application or the particular data pipeline. Tools like Terraform and CloudFormation make change management a not-so-difficult task compared to resources that are being deployed with no-code tools. It improves visibility and monitoring a lot and helps to identify potential data issues early. If you are at the beginning of your career path and considering data engineering then infrastructure as code is definitely something worth learning. This is a very useful skill that looks great on your CV. If you are a seasoned data engineer it would be great to know what you think about modern data platform design, resource provisioning and tools you use. The feedback is greatly appreciated.

I hope you find this article useful. Thanks for reading!