What Does it Take to Get into Data Engineering in 2024?

Author:Murphy  |  View: 27161  |  Time: 2025-03-22 22:25:36
AI-generated image using Kandinsky

If you are reading this you were probably considering a career change lately. I am assuming that you want to learn somewhat close to software engineering and database design. It doesn't matter what your background is – marketing, Analytics or finance, you can do this! This story is to help you find the fastest way to enter the data space. Many years ago I did the same and never regretted since then. Technology space and especially data is full of wonders and perks. Not to mention remote working and massive benefit packages from the leading IT companies, it makes you capable of doing magic with files and numbers. In this story, I'll try to summarise a set of skills and possible projects which could be accomplished within two to three months timeframe. Imagine, just a few months of active learning and you are ready for your first job interview.


Any sufficiently advanced technology is indistinguishable from magic.

Why data engineering and not data science?

Indeed, why not Data Analytics or Data Science? I think the answer resides in the nature of this role as it combines the most difficult parts of these worlds. To become a data engineer you would need to learn Software engineering and database design, Machine Learning (ML) models, and understand data modelling and Business Intelligence (BI) development.

Data Engineering is the fastest growing job according to DICE. They conducted research to demonstrate that there is a gap so be quick.

Source: DICE

While Data Scientists have been considered to be the "sexiest" job in the market for a long time now it seems there is a certain lack of Data Engineers. I can see a massive demand in this area. This includes not only experienced and highly qualified engineers but also entry-level roles. Data engineering has been one of the fastest-growing careers in the UK over the last five years, ranking 13 on LinkedIn's list of the most in-demand jobs in 2023 [1]. On average I get around 4 job interview invites every week. Entry-level data engineers would be invited even more often.

Due to the fact that Data Engineering is so complex, salaries and benefit packages look a lot better than in other technology areas. Indeed, I know a lot of great software engineers who basically prefer to stay away from data as it looks like a set of tedious and boring data manipulation tasks. Indeed, it makes data engineering a lucrative goal for those who are eager to learn data platforms and data pipeline design patterns. Data engineering is all about data manipulation and how we orchestrate the process. Data must be cleansed, tested, approved and delivered in a timely manner to end users. This is why ML and BI rely on it so heavily.

This is why it pays very well and it won't be boring

In this story, I'll try to summarise a set of skills and possible projects which could be accomplished within two to three months timeframe. Imagine, just a few months of active learning and you are ready for your first job interview.

Data Engineering might seem overwhelming

Getting into Data Engineering without a STEM background might be very challenging. Coding is not a trivial task per se. Database and data pipeline orchestration is even harder to comprehend in the first place from my experience. Many years ago I graduated with a master's degree in Quantitative Finance and worked as Analytics Manager. I remember the day I decided to learn coding. Not to that level I got from the University but to be able to apply my skills in real life to solve real-world problems.

I remember it was a real struggle as I had to do my daily job and learn software engineering only when I had spare time.

My transition from Analytics Manager to Software Engineer was a real struggle.

I remember I was hunting projects on Fiver and PeoplePerHour just to see what companies were after in terms of data. Now I remember it helped me a lot to understand the real pain points of many customers and that was probably the most efficient way to learn.

So my first advice to all aspiring data practitioners is to believe.

Believe in yourself!

Getting into the data engineering space might seem to be overwhelming but it's worth it. Don't be shy and ask people who write. Medium is such a great place for that. Why not check the topic you are after and see who to follow?

https://medium.com/tag/data-engineering/who-to-follow

The plan

All you need is to take a break and think a little about whether you really need this or not. If the answer is YES then all you need is a plan. Speed is not a goal here and all we would want to do now is to write down a feasible solution "How to get into Data Engineering" without any pain if possible.

Enter the Data Engineering in 2024

For now, let's just focus on the next couple of months and things we need to learn or recap.

The Habit of Data Engineering

We would want to acquire this habit by actually learning during the first two weeks. Do it in little steps but do it consistently. It should form the habit of learning. For instance, this is what I did while I was getting ready for my Google Professional Data Engineer Exam [2]. It was difficult but I was reading every day while I was cycling in the gym. I did it in the morning as it was the most productive time of the day for me. This article from 2020 is still valid as not so many things have actually changed and learning was mostly about the basic principles of data engineering. Of course, there were a lot of product-specific questions but the article is a guide on how to learn fast.

Consistency is key

How I passed the Google Professional Data Engineer Exam in 2020

Data Engineering is all about these areas of technology:

  • Etl and data extraction
  • Data manipulation and data modelling (often with SQL)
  • Testing the pipelines
  • Testing the data
  • Reporting and BI
  • MLOps and ML pipelines
  • Orchestrating all this madness

The first 1–2 weeks: SQL

Let's focus on SQL first. Even though it is not the first item on our list it is the most universal one in my opinion. SQL dialect has been so widely used in data modelling it can be now considered as a standard of data manipulation. All we need to do during the first two weeks is to run different SQL queries and try to imagine which data pipelines they can be used in. Things we would want to recap here might be the following:

  • How to create a table using SQL
  • How to use common table expressions
  • How to mock data using SQL
  • How to update the table using incremental strategies
  • How to test data quality and cleanse the data

These questions might seem overwhelming but there are plenty of great and simple examples which coupled with some free data warehouse solutions could help us to create a fairly simple and productive sandbox. I previously wrote about it in one of my previous stories. SQL-wise it's all you really need in your day-to-day data engineering [3]

Advanced SQL techniques for beginners

Even the most difficult topics like MERGE can be easily explained when SQL includes some mocked data inside a CTE:

create temp table last_online as (
    select 1 as user_id
    , timestamp('2000-10-01 00:00:01') as last_online
)
;
create temp table connection_data  (
  user_id int64
  ,timestamp timestamp
)
PARTITION BY DATE(_PARTITIONTIME)
;
insert connection_data (user_id, timestamp)
    select 2 as user_id
    , timestamp_sub(current_timestamp(),interval 28 hour) as timestamp
union all
    select 1 as user_id
        , timestamp_sub(current_timestamp(),interval 28 hour) as timestamp
union all
    select 1 as user_id
        , timestamp_sub(current_timestamp(),interval 20 hour) as timestamp
union all
    select 1 as user_id
    , timestamp_sub(current_timestamp(),interval 1 hour) as timestamp
;

merge last_online t
using (
  select
      user_id
    , last_online
  from
    (
        select
            user_id
        ,   max(timestamp) as last_online

        from 
            connection_data
        where
            date(_partitiontime) >= date_sub(current_date(), interval 1 day)
        group by
            user_id

    ) y

) s
on t.user_id = s.user_id
when matched then
  update set last_online = s.last_online, user_id = s.user_id
when not matched then
  insert (last_online, user_id) values (last_online, user_id)
;
select * from last_online
;

Weeks 3–4: Modern data stack

I would recommend reading a few stories about the modern data stack and data platform architecture types [4] as it gives us a good strategic overview of various tools and frameworks that can be used in data engineering. This knowledge also becomes useful during a job interview as it tells the recruiter that you are technologically savvy. You might not know all the tools, you don't need but "Are you in the space" – is the most fascinating question I could possibly imagine during one of my first meetings with a potential employer. Here we would want to demonstrate our awareness of the most recent events in tech (IPOs, mergers and acquisitions), developments and new tools. Simply mentioning that you heard about DuckDB or Polars or anything tells people that you are curious and passionate.

Data Platform Architecture Types

Are you in the space?

It is easy to get lost with the abundance of data tools available in the market right now. I remember we were talking about Snowflake then and how successful that IPO was. I think that helped a lot or at least we were on the same note with the interviewer. We discussed the Modern Data Stack and the things that make it modern, robust and cost-effective. To put it simply, it is a collection of tools used to work with data. Depending on what we are going to do with the data, these tools might include the following:

  • a managed ETL/ELT data pipeline services
  • a cloud-based managed data warehouse/ data lake as a destination for data
  • a data transformation tool
  • a business intelligence or data visualization platform
  • machine learning and data science capabilities

We already learned a few things about how to transform and manipulate the data using SQL during the first two weeks. Now we can wrap it around with this strategic knowledge which tells us how to apply it.

Modern Data Stack. Image by author

Weeks 5–6: Python basics

Let's recap or learn a bit of actual coding. Python is definitely the easiest way to learn. It's very scripted, has a lot of useful libraries and is fairly easy to understand. All these features make it a popular choice of programming language in Data Engineering. We would want to focus on some basic programming concepts like loops, functions, conditionals, error handling and data structures. I think in Data engineering you would use these a lot.

I would recommend starting with data APIs and requests. Combining this knowledge with Cloud services gives us a very good foundation for any ETL processes we might need in the future.

The typical data pipline [5] would be a chain of Python functions (or operators) and it would look like this:

Data pipeline. Image by author

There is always a data pipeline when data is processed between source and destination.

We process data using Python functions and it finally can result in a pipeline like this for example:

ETL pipeline example. Image by author

More examples can be found here:

Data pipeline design patterns

API requests

Understanding how API requests work is important because this is the main way any ETL service would interact with other services, i.e. to extract data. Data engineers use it a lot when they hit some API service with a request asking for data, and then they paginate or stream the response to actually transform (ETL) the data. Consider this example below. It explains how to extract data from NASA Asteroids API. This is a very simple example and this is why it is so good for learning.

# nasa.py
import requests
session = requests.Session()

url="https://api.nasa.gov/neo/rest/v1/feed"
apiKey="your_api_key"
requestParams = {
    'api_key': apiKey,
    'start_date': '2023-04-20',
    'end_date': '2023-04-21'
}
response = session.get(url, params = requestParams, stream=True)
print(response.status_code)

More advanced and fairly runnable examples can be found in this story [6]

Python for Data Engineers

Weeks 7–8: Extract – Load

When we learn a bit of Python and SQL we can actually extract data and save it somewhere in the Cloud. Cloud service providers like AWS, GCP and Azure are the market leaders and familiarising at least one of them is a must. So now we would want to actually build our first data pipeline. That can be a simple function to extract NASA Asteroids data and save it in AWS S3. That's it! Very simple but this is our first data pipeline and it can be scheduled to run daily, hourly, etc. It can be deployed as a serverless microservice and will be running for free extracting and preserving the data in the cloud storage. We can easily deploy it using AWS web UI. However, the preferable way of deploying any service is Infrastructure as Code. The topic is difficult to comprehend per se and I wouldn't recommend to deep dive into it if you are a beginner.

Just Keep in mind that anything that we deploy in the cloud using a web user interface can be deployed using shell scripting.

During the next couple of weeks, I would recommend focusing on command-line tools and learning a few tricks to deploy cloud functions and provision resources in the cloud.

Scripts help to avoid human errors and save a lot of time.

We can create an ETL service like so as a simple AWS Lambda:

# Deploy packaged Lambda using AWS CLI:
aws 
lambda create-function 
--function-name etl-service-lambda 
--zip-file fileb://stack.zip 
--handler /app.lambda_handler 
--runtime python3.12 
--role arn:aws:iam:::role/my-lambda-role

# # If already deployed then use this to update:
# aws --profile mds lambda update-function-code 
# --function-name mysql-lambda 
# --zip-file fileb://stack.zip;

For instance, we can use AWS CLI to invoke our ETL service like so:

aws lambda invoke 
    --function-name etl-service-lambda 
    --payload '{ "data": "value" }' 
    response.json

Some decent code examples with lambda package can be found here [7]

Building a Batch Data Pipeline with Athena and MySQL

Now when the data is in the cloud we can load it into the data warehouse tool. I recommend BigQuery because it would charge $5 per 1Tb of data scanned so working with test data will cost us nothing. We can create an external table like so [8]:

# Assuming data is stored Google Cloud Storage
LOAD DATA INTO source.nasa_asteroids
FROM FILES(
  format='JSON',
  uris = ['gs://nasa-asteroids-data/*']
)

Also, you might want to try this article to learn the basics of AWS CLI for example [9]

Mastering AWS CLI

Weeks 9–10

Here we would want to learn the basics of unit testing in Python and how to orchestrate our ETL service. We already learned how to create Python functions to perform API calls and extract data. Maybe we even learned how to transform it in Python using examples from Python for Data Engineers. Now we would want to test the data transformation logic we apply. Unit testing is an essential skill in software engineering and it saves a lot of time in the long run. It helps to keep our code tested and maintained. Long story short, we would want to learn the basics of the Pytest module. For example, we need to be able to test the logic of our ETL function like the one below:

# etl.py
def etl(item):
    # Do some data transformation here
    return item.lower()

We can do this simply by running this Python script [10]:

# etl_test.py
from etl import etl

def test_etl_returns_lowercase():
    assert etl('SOME_UPPERCASE') == 'some_uppercase'

A Guide To Data Pipeline Testing with Python

Consider another example with unittest library:

# ./prime.py
import math

def is_prime(num):
    '''Check if num is prime or not.
    '''
    for i in range(2,int(math.sqrt(num))+1):
        if num%i==0:
            return False
    return True

How do we test the logic inside this function?

unittest makes it simple. The test would be the following:

# ./test.py
import unittest
from prime import is_prime

class TestPrime(unittest.TestCase):

    def test_thirteen(self):
        self.assertTrue(is_prime(13))

Now that we know the basics of testing we are confident that our functions return the data they should return. So the next step would be to deploy an extra microservice to orchestrate the ETL process. In a nutshell, it can be a simple AWS Lambda function or any other serverless application able to invoke our ETL service on a schedule. This is very simple and we wouldn't want to overcomplicate things here. Let's deploy another Python Lambda function and schedule it to run daily or hourly. We can use the AWS EventBridge event with a cron schedule for that. Our Orchestrator Lambda code can look like the one below.

The most simple data pipeline orchestration example

import json
import boto3

# AWS Lambda Client to invoke another Lambda
client = boto3.client('lambda')

def lambda_handler(event,context):

    # Some data to pass to another Lambda
    data = {
        "ProductName"   : "iPhone SE"
    }

    response = client.invoke(
        FunctionName = 'arn:aws:lambda:eu-west-1:12345678:function:etl-ervice-lambda',
        InvocationType = 'RequestResponse',
        Payload = json.dumps(data)
    )

    response = json.load(response['Payload'])

    print('n')
    print(response)

If you are keen to learn it further there is an advanced tutorial with AWS Step functions and Infrastructure as Code [11]

Data Pipeline Orchestration

Weeks 11–12

Learn some ML basics. We know how to extract data and perform ETL, and we know how to load it into the data warehouse.

Yes, it's time to proceed to ML.

Consider this tutorial below. It explains how to work with user churn and predicts user propensity to churn using behavioural data. It can be accomplished just in a few hours but if you want to go deeper into the nitty-gritty of retention it might take longer. We don't need to know every machine-learning model. We can't compete with cloud service providers such as Amazon and Google in machine learning and data science but we need to know how to use it. There are numerous managed ML services provided by cloud vendors and we would want to familiarize ourselves with them. Data engineers prepare datasets for these services and it will definitely be useful to do a couple of tutorials on this.

I think it's a good example of built-in Machine Learning capabilities

User Churn Prediction

Conclusion

Don't feel like your goal is to learn Data Engineering as quickly as possible and to become an expert in this area. For many people, it takes years to master a certain field so my advice would be to focus on what can be accomplished in a couple of months while learning over the weekends. Data engineers must have good knowledge of ETL/ELT techniques, and data modelling and also must be able to code at least in Python. In this story, I tried to outline a 12-week plan to learn data engineering in the most efficient way possible. I hope you enjoyed it.

How to Become a Data Engineer

Recommended read

[1] https://www.linkedin.com/pulse/linkedin-jobs-rise-2023-25-uk-roles-growing-demand-linkedin-news-uk/

[2] https://towardsdatascience.com/how-i-passed-google-professional-data-engineer-exam-in-2020-2830e10658b6

[3] https://towardsdatascience.com/advanced-sql-techniques-for-beginners-211851a28488

[4] https://towardsdatascience.com/data-platform-architecture-types-f255ac6e0b7

[5] https://towardsdatascience.com/data-pipeline-design-patterns-100afa4b93e3

[6] https://towardsdatascience.com/python-for-data-engineers-f3d5db59b6dd

[7] https://towardsdatascience.com/building-a-batch-data-pipeline-with-athena-and-mysql-7e60575ff39c

[8] https://medium.com/towards-artificial-intelligence/when-your-stack-is-a-lake-house-6bcb17f9bff6

[9] https://medium.com/geekculture/mastering-aws-cli-5454ad5e685c

[10] https://towardsdatascience.com/a-guide-to-data-pipeline-testing-with-python-a85e3d37d361

[11] https://towardsdatascience.com/data-pipeline-orchestration-9887e1b5eb7a

Tags: Analytics Career Advice Data Engineering Editors Pick Etl

Comment