Makefile Tutorial

Author:Murphy  |  View: 27137  |  Time: 2025-03-23 13:04:47
Photo by Nubelson Fernandes on Unsplash

Background

Data Scientists are now expected to write production code to deploy their Machine Learning algorithms. Therefore, we need to be aware of software engineering standards and methods to ensure our models are deployed robustly and effectively. One such tool that is very well known in the developer community is make. This a powerful Linux command that has been known to developers for a long time and in this article I want to show how it can be used to build efficient machine learning pipelines.

What Is Make?

make is a terminal command/executable just like ls or cd that is in most UNIX-like operating systems such as MacOS and Linux.

The use of make is to simplify and breakdown your workflow into a logical grouping of shell commands.

It is used widely by developers and is also being adopted by Data Scientists as it simplifies the machine learning pipeline and enables more robust production deployment.

Why Make For Data Science?

make is a powerful tool that Data Scientists should be utilising for the following reasons:

  • Automate the setup of machine learning environments
  • Clearer end-to-end pipeline documentation
  • Easier to test models with different parameters
  • Obvious structure and execution of your project

What Is A Makefile?

A Makefile is basically what the make commands read and execute from. It has three components:

  • Targets: These are the files you are trying to build or you have a PHONY target if you are just carrying out commands.
  • Dependencies: Source files that need to be run before this target is executed.
  • Command: As it says on the tin, these are the list of steps to produce the target.

Basic Example

Let's run through a very simple example to make this theory concrete.

Below is a Makefile that has the target hello with the command echo to print 'Hello World' to the screen and it has no dependencies:

# Define our target as PHONY as it does not generate files
.PHONY: hello

# Define our target
hello:
 echo "Hello World!"

We can run this by simply executing make hello in the terminal which will give the following output:

echo "Hello World!"
Hello World!

It essentially just listed and carried out the command. This is the essence of make there is nothing too complicated going on.

Notice that we made the target hello a .PHONY as it doesn't produce a file. This is the meaning behind .PHONY, only use it for targets that don't spit out a file.

We can add an @ symbol before the echo command if we don't want to print it to the screen.

We can add another target in the Makefile to generate a file:

# Define some targets as PHONY as they do not generate files
.PHONY: hello

# Define our target
hello:
 echo "Hello World!"

# Define our target to generate a file
data.csv:
 touch data.csv

To run the data.csv target, we execute make data.csv:

touch data.csv

And you should notice a data.csv file in your local directory.

Machine Learning Pipeline

Overview of a Pipeline

Below is an example pipeline for a machine learning project we will construct using Makefile and make. It is based on a previous project where I built on ARIMA model to forecast US airline passengers. You can check out more about it here:

How To Forecast With ARIMA

Diagram by author.

So, the read_clean_data.py file will load in and make the time series data stationary. The model.py file will fit an ARIMA model to the cleaned data. Finally, the analysis.py file will compute the performance of our forecast.

Another key thing to notice here is the dependency between files. The analysis.py can't run unless model.py has been executed. This is where the dependencies in the Makefile become useful.

Walkthrough

Below is our first file read_clean_data.py:

Data from Kaggle with a CC0 licence.

Here we read our US airline data and make it stationary through differencing and the Box-Cox transform and save it to a file in the local directory called clean_data.csv.

Then, we have the model.py file:

And finally, we have our analysis file, analysis.py:

We can then code the following Makefile for our three stage pipeline:

.PHONY: all read_clean_data model analysis

all: analysis

read_clean_data:
 python read_clean_data.py

model: read_clean_data
 python model.py

analysis: model
 python analysis.py

.PHONY: clean
clean:
 rm -f clean_data.csv lam.pickle train_data.csv test_data.csv forecasts.csv

Notice how we have declared the dependencies of each step to the previous one to ensure we have the correct files to carry every step. We have also added the clean target to remove the generated files if needed.

The whole pipeline can run through the make all command, and the output will look like this:

Output:

python read_clean_data.py
python model.py
python analysis.py

And will generate the following plot:

Plot generated by author in Python.

As you can see, the Makefile file pipeline worked and the forecasts look pretty good!

Summary & Further Thoughts

That's it! I hope you enjoyed this short tutorial on make and Makefile. Of course, there is more complexity and fancy things you can do with these tools, but this post can serve as your starting point. The key things to remember are:

  • make is a UNIX command that automates the running of certain workflows
  • A Makefile allows us to write several make commands and sequences to automate the machine learning pipeline

The full code used in this article is available on my GitHub here:

Medium-Articles/Software Engineering /make-example at main · egorhowell/Medium-Articles

References & Further Reading

Another Thing!

I have a free newsletter, Dishing the Data, where I share weekly tips for becoming a better Data Scientist. There is no "fluff" or "clickbait," just pure actionable insights from a practicing Data Scientist.

Dishing The Data | Egor Howell | Substack

Connect With Me!

Tags: Coding Data Science Machine Learning Makefile Programming

Comment