Streaming in Data Engineering

Author:Murphy | View: 29915 | Time: 2025-03-22 23:46:23

Streaming is one of the most popular data pipeline design patterns. Using an event as a single data point creates a constant flow of data from one point to another enabling an opportunity for real-time data ingestion and analytics. If you want to familiarise yourself with data streaming and learn how to build real-time data pipelines this story is for you. Learn how to test the solution, and mock test data to simulate event streams. This article is a great opportunity to acquire some sought-after Data Engineering skills working with popular streaming tools and frameworks, i.e. Kinesis, Kafka and Spark. I would like to speak about the benefits, examples, and use cases of Data Streaming.

What exactly is data streaming?

Streaming data, also known as event stream processing, is a Data Pipeline design pattern when data points flow constantly from the source to the destination. It can be processed in real-time, enabling real-time analytics capabilities to act on data streams and analytics events super fast. Applications can trigger immediate responses to new data events thanks to stream processing and typically it would be one of the most popular solutions to process the data on an enterprise level.

There is a data pipeline whenever there is data processing between points A and B [1].

Streaming data pipeline example. Image by author

In this example, we can create an ELT Streaming data pipeline to AWS Redshift. AWS Firehose delivery stream can offer this type of seamless integration when it creates a data feed directly into the data warehouse table. Then data will be transformed to create reports with AWS Quicksight as a BI tool.

Let's imagine we need to create a reporting dashboard to display revenue streams in our company. In many scenarios, a business requirement is to generate insights in real-time. This is exactly the case when we would want to use streaming.

Data streams can be generated by various data sources, i.e. IoT, server data streams, marketing in-app events, user activity, payment transactions, etc. This data can flow in different formats and often vary in volume. The idea of the streaming pattern is to apply ETL on the go and process event streams seamlessly.

Whenever we need to act on up-to-the-millisecond data latency streaming is the right way to go.

Consider this example below to better understand it. All applications use OLTP databases [4], MySQL for example. Your app is one of those but you need this data in the data warehouse solution (DWH), i.e. Snowflake or BigQuery.

Data Modelling For Data Engineers

Using a batch data pipeline solution we would want to load from MySQL to DWH once a day/hour/every five minutes, etc. Stream opposite to that would use a dedicated system, such as Kafka Connect for example. It will process data as soon as it lands in our database.

Popular data streaming tools

Let's take a look into popular data streaming platforms and frameworks that proved themselves most useful over the last couple of years.

Apache Spark – frameworks for distributed data computing for large-scale analytics and complex data transformations
Apache Kafka – a real-time data pipeline tool with a distributed messaging system for apps
AWS Kinesis – a real-time streaming platform for analytics and applications
Google Cloud Dataflow – Google's streaming platform for real-time event processing and analytics pipelines
Apache Flink – a distributed streaming data platform designed for low-latency data processing.

Almost all of them have their managed cloud-based services (AWs Kinesis, Google Cloud Dataflow) and can be seamlessly integrated with other services such as storage (S3), queuing (SQS, pub/sub), data warehouses (Redshift) or AI platforms. All of them can be deployed on Kubernetes, Docker or Hadoop and aim to solve one problem – dealing with high-volume and high-velocity data streams.

Benefits of data streaming

Streaming data pipeline design patterns helps organisations to proactively mitigate the impact of adverse business events related to delay in data processing, i.e. various losses and outages, customer churn and financial downturns. Because of the complexity of today's business needs, conventional batch data processing is a ‘No Go' solution, as it can only process data as groups of transactions accumulated over time.

So here are some business advantages of using data streaming:

Increase in customer satisfaction and as a result increased retention rates
Reduced operational losses as it can give real-time insights on system outages and breaches
Increased return on investment as companies can now act faster on business data with increased responsiveness to customer needs and market trends.

The main technical advantage is in data processing as it runs event processing strictly one by one. Opposite to batch processing, it has a better fault tolerance and if one event in the pipeline can't be processed due to some reason then it is only this event. In the batch pipeline, the whole chunk of data processing will fail because of this single data point which might have a wrong schema or incorrect data format.

The main disadvantage of streaming data pipelines is the cost

Each time our stream processor hits the endpoint it will require compute power. Typically streaming data processing will result in higher costs related to this particular data pipeline.

Challenges in building streaming data pipelines

Fault Tolerance – Can we design and build a data platform that handles data processing failures from a single point of data event? Very often data comes from different data sources. It might be coming even in different formats. Data availability and durability becomes important consideration while designing the data platform [3] with streaming component.

Data Platform Architecture Types

Queuing and ordering – Events in the data stream must be ordered correctly. Otherwise, data processing might fail. Indeed, in-app messaging will not make sense if ordered incorrectly, for example.
Scalability – Applications scale. It is as simple as that. Designing a data pipeline that responds well to an increased number of events coming from the source is not a trivial task. Being able to add more resources and data processing capacity to our data pipeline is a crucial component of a robust data platform.
Data consistency – Often in distributed data platforms data is being processed in parallel. This might become a challenge as data in one data processor could already be modified and become stale in another one.

A real-world example

Let's take a look at this example of a streaming data pipeline built with AWS Kinesis and Redshift.

Amazon Kinesis Data Firehose is an ETL service that collects, transforms, and distributes streaming data to data lakes, data storage, and analytics services with high reliability.

We can use it to stream data into Amazon S3 and convert it to the formats needed for analysis without having to develop processing pipelines. It is also great for Machine learning (ML) pipelines where models are used to examine data and forecast inference endpoints as streams flow to their destination.

Kinesis Data Streams vs Kinesis Data Firehose

Kinesis Data Streams is primarily focused on consuming and storing data streams. Kinesis Data Firehose is designed to deliver data streams to specific destinations. Both can consume data streams, but which one to use depends on where we want our streaming data to go.

AWS Kinesis Data Firehose allows us to redirect data streams into AWS data storage. Kinesis Data Firehose is the most straightforward method for gathering, processing, and loading data streams into AWS data storage.

Amazon Kinesis Data Firehose supports batch operations, encryption, and compression of streaming data, as well as automated scalability in the terabytes per second range. Firehose can seamlessly integrate with S3 data lakes, RedShift data warehouse solutions or ElasticSearch service.

AWS Kinesis Data Streams is an Amazon Kinesis real-time data streaming solution with exceptional scalability and durability where data streams are available 24/7 for any consumer. It makes it more expensive than the Kinesis Data Firehose.

How to create a Firehose resource using AWS Cloudformation

Consider this CloudFormation template below. It deploys the required resources including the Firehose we need.

AWSTemplateFormatVersion: 2010-09-09
Description: >
  Firehose resources relating to data generation.

Parameters:
  Environment:
    AllowedValues:
      - staging
      - production
    Description: Target environment
    Type: String
    Default: 'staging'
  DataLocation:
    Description: S3 data lake bucket name.
    Type: String
    Default: data.staging.aws

Resources:
  MyDataStream:
    Type: AWS::KinesisFirehose::DeliveryStream
    Properties: 
      DeliveryStreamName: !Sub 'my-event-${Environment}'
      DeliveryStreamType: DirectPut
      ExtendedS3DestinationConfiguration: 
        BucketARN: 
          - !Sub 'arn:aws:s3:::${DataLocation}' # For example: 'arn:aws:s3:::data.staging.aws'
        BufferingHints:
          IntervalInSeconds: 300
          SizeInMBs: 30
        CompressionFormat: UNCOMPRESSED
        Prefix: !Sub 'events/my-event-${Environment}/'
        RoleARN: !GetAtt AccessRole.Arn

  AccessRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - firehose.amazonaws.com
            Action:
              - sts:AssumeRole
      Path: /
      Policies:
        - PolicyName: !Sub '${AWS::StackName}-AccessPolicy'
          PolicyDocument:
            Statement:
              - Effect: Allow
                Action:
                  - s3:AbortMultipartUpload
                  - s3:GetBucketLocation
                  - s3:GetObject
                  - s3:ListBucket
                  - s3:ListBucketMultipartUploads
                  - s3:PutObject
                Resource:
                  - !Sub 'arn:aws:s3:::${DataLocation}'
                  - !Sub 'arn:aws:s3:::${DataLocation}/*'
                  # - 'arn:aws:s3:::data.staging.aws' # replace with your S3 datalake bucket
                  # - 'arn:aws:s3:::data.staging.aws/*'
              - Effect: Allow
                Action:
                  - kinesis:DescribeStream
                  - kinesis:GetShardIterator
                  - kinesis:GetRecords
                Resource:
                  - !Sub 'arn:aws:kinesis:${AWS::Region}:${AWS::AccountId}:stream/my-event-${Environment}'

It can be deployed in AWS using the AWS CLI tool. We need to run this on our command line (replace with unique bucket names in your account):

./deploy-firehose-staging.sh s3-lambda-bucket s3-data-lake-bucket

Our shell script would look like this:

#!/usr/bin/env bash
# chmod +x ./deploy-firehose-staging.sh
# Run ./deploy-firehose-staging.sh s3-lambda-bucket s3-data-lake-bucket
STACK_NAME=FirehoseStackStaging
LAMBDA_BUCKET=$1 #datalake-lambdas.aws # Replace with unique bucket name in your account
S3_DATA_LOCATION=$2 #data.staging.aws # S3 bucket to save data, i.e. datalake
# Deploy using AWS CLI:
aws 
cloudformation deploy 
--template-file firehose_stack.yaml 
--stack-name $STACK_NAME 
--capabilities CAPABILITY_IAM 
--parameter-overrides 
"Environment"="staging" 
"DataLocation"=$S3_DATA_LOCATION #"data.staging.aws"

Firehose resources created. Image by author

Now we would want to create an event producer. We can do it in Python and the code for our app.py would look like this:

import boto3
kinesis_client = boto3.client('firehose', region_name='eu-west-1')
...
response = client.put_record_batch(
    DeliveryStreamName='string',
    Records=[
        {
            'Data': b'bytes'
        },
    ]
)

put_record_batch method writes many data records into a delivery stream in a single call, allowing for better throughput per producer than single record writing. PutRecord is used to write single data records into a delivery stream. It is up to you which one to choose in this tutorial.

We can generate some synthetic data in our app.py using this helper function below.

def get_data():
    '''This function will generate random data for Firehose stream.'''
    return {
        'event_time': datetime.now().isoformat(),
        'event_name': random.choice(['JOIN', 'LEAVE', 'OPEN_CHAT', 'SUBSCRIBE', 'SEND_MESSAGE']),
        'user': round(random.random() * 100)}

Now this data can be send to our event producer like so:

    try:
        print('Sending events to Firehose...')
        for i in range(0, 5):
            data = get_data()
            print(i, " : ", data)
            kinesis_client.put_record(
                DeliveryStreamName=STREAM_NAME,
                Record={
                    "Data":json.dumps(data)
                }
            )
            processed += 1
        print('Wait for 5 minutes and Run to download: aws s3 cp s3://{}/events/ ./ --recursive'.format(S3_DATA))
        # For example, print('Wait for 5 minutes and Run to download: aws s3 cp s3://data.staging.aws/events/ ./ --recursive')
    except Exception as e:
        print(e)

Done! We have created a simple streaming data pipeline that outputs aggregated results into cloud storage (AWS S3).

Run python app.py in your command line:

Events connector example. Image by author

Check my tutorial below for a more advanced data pipeline example [2]

Building a Streaming Data Pipeline with Redshift Serverless and Kinesis

Conclusion

The ideal streaming data platform for your project doesn't exist. The streaming design has its benefits but also we can see some obvious challenges while using it. Which streaming tool to choose is not an easy choice. It depends on your business goals and functional data requirements. You would want to try and compare multiple streaming platforms based on characteristics such as functionality, performance, cost, simplicity of use, and compatibility. Is it going to be a machine-learning pipeline? Do we need to work with partitions, windows and joins? Do we need high throughput, fault tolerance or low latency?

Different streaming frameworks have different capabilities, for example, Kafka has a handy session library which can be easily integrated into your analytics pipeline.

What frequency of data delivery and consumption do we need in our pipeline? Is it going to be a delivery into a DWH solution or into a data lake? Some platforms can offer better integration features than others.

Another essential element to consider is the type and complexity of data processing and analysis that must be performed on your streaming data.

I would recommend creating a prototype based on your own data pipeline scenario and requirements gathered from the main stakeholders inside the company. The optimal streaming data pipeline would be the one that adds value to the business and also meets your data engineering goals.