How to Write Clean Code in Python

Author:Murphy | View: 25912 | Time: 2025-03-22 22:43:30

Writing Clean Code is not just a nice thing to have. Writing clean code is required whenever you run production-ready code.

As a data scientist, I primarily worked with Jupyter Notebooks, aiming to develop a model that works with the available data. In the beginning, it was all about showing that, in general, AI can provide a value given the data.

But as soon as that was proven, the model needed to be put into production. And this is where the pain started.

Most of the code was ugly and not readable and maintainable. As a Data Scientist, I honestly didn't really care.

But now, as I am a machine learning engineer, there is nothing more important than writing clean code whenever you write code that shall be reused and put into production.

This is why I read the book Clean Code: A Handbook of Agile Software Craftsmanship. This book is the manifesto for writing clean code. Its principles are applicable to all programming languages, even if the book always relates to Java.

In this article, I highlight the most important clean code rules and want to adapt these principles to Python so that you can directly relate them to your day-to-day coding.

For each principle, I will provide small code snippets to explain the principles better and show you how to do things and how not to do things.

I hope that this article provides value to everyone working with Python but especially motivates other Data Scientists to start writing clean code in order to ease their life and the life of the machine learning engineer, as putting the models and the provided code into production will then be a lot easier for both.

Meaningful Names

This part should be obvious, but many developers are still not following it.

Create meaningful names!

Everyone should directly understand what is happening when reading your code. Putting inline comments describing what your code is doing and what some variables stand for should not be required.

If the name is descriptive, it should be more than clear what the function is doing.

Let's look at a typical example in machine learning: Loading a dataset and splitting it into train and test splits:

import pandas as pd
from sklearn.model_selection import train_test_split

def load_and_split(d):
    df = pd.read_csv(d)
    X = df.iloc[:, :-1]
    y = df.iloc[:, -1]
    X_train, X_test, y_train, y_test = train_test_split(X, 
                                                        y, 
                                                        test_size=0.2, 
                                                        random_state=42)
    return X_train, X_test, y_train, y_test

Most people coming from Data Science know what is happening here, and they also know what X is and what y is. But what about people new to the field?

Is it good practice to name the path to the CSV file just with d?

And is it good practice to name the features X and the target y?

Let's look at an example with more meaningful names:

Python">import pandas as pd
from sklearn.model_selection import train_test_split

def load_data_and_split_into_train_test(dataset_path):
    data_frame = pd.read_csv(dataset_path)
    features = data_frame.iloc[:, :-1]
    target = data_frame.iloc[:, -1]
    features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=42)
    return features_train, features_test, target_train, target_test

This is way better to understand. It is now more than clear, without prior experience in using pandas and train_test_split, that this function is loading data from a CSV file, stored at the path listed in dataset_path, retrieving the features and targets out of the data frame, and then computing the features and targets for the training set and test set.

These changes make the code more readable and understandable, especially for someone who might not be familiar with the conventions of machine learning code, where features are mostly named with X and targets are named with y.

But please also don't exaggerate with naming that doesn't give any additional information.

Let's look at another example code snippet:

import pandas as pd
from sklearn.model_selection import train_test_split

def load_data_from_csv_and_split_into_training_and_testing_sets(dataset_path_csv):
    data_frame_from_csv = pd.read_csv(dataset_path_csv)
    features_columns_data_frame = data_frame_from_csv.iloc[:, :-1]
    target_column_data_frame = data_frame_from_csv.iloc[:, -1]
    features_columns_data_frame_for_training, features_columns_data_frame_for_testing, target_column_data_frame_for_training, target_column_data_frame_for_testing = train_test_split(features_columns_data_frame, target_column_data_frame, test_size=0.2, random_state=42)
    return features_columns_data_frame_for_training, features_columns_data_frame_for_testing, target_column_data_frame_for_training, target_column_data_frame_for_testing

What do you feel when looking at that code snippet?

Is it required to include that the function loads from CSV? And that the dataset path is a path to a CSV file?

This code snippet contains way too much information that does not deliver any additional information. It instead distracts the reader.

So, adding meaningful names is the act of balancing descriptiveness with shortness.

Functions

Let's now come to functions.

The first rule of functions is that they should be small. The second rule of functions is that they should be smaller than that [1].

This is really important! The function should be small and not more than 20 lines long. If there are large blocks in that function that consume a lot of space, put them into a new function.

Another essential principle is that function should do ONE thing. And not more. If they do more, separate the second thing into a new function.

Let's again take a look into a small example:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

def load_clean_feature_engineer_and_split(data_path):
    # Load data
    df = pd.read_csv(data_path)

    # Clean data
    df.dropna(inplace=True)
    df = df[df['Age'] > 0]

    # Feature engineering
    df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 18, 65, 99], labels=['child', 'adult', 'senior'])
    df['IsAdult'] = df['Age'] > 18

    # Data preprocessing
    scaler = StandardScaler()
    df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])

    # Split data
    features = df.drop('Survived', axis=1)
    target = df['Survived']
    features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=42)
    return features_train, features_test, target_train, target_test

Can you already spot violations of the above-mentioned rule?

This function is not long but clearly violates the rule that a function should do one thing.

Also, the comments indicate that these blocks of code could be placed in a separate function, as the one-line comments would not be required at all (more about this in the next section).

So, let's take a look at a refactored example:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

def load_data(data_path):
    return pd.read_csv(data_path)

def clean_data(df):
    df.dropna(inplace=True)
    df = df[df['Age'] > 0]
    return df

def feature_engineering(df):
    df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 18, 65, 99], labels=['child', 'adult', 'senior'])
    df['IsAdult'] = df['Age'] > 18
    return df

def preprocess_features(df):
    scaler = StandardScaler()
    df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
    return df

def split_data(df, target_name='Survived'):
    features = df.drop(target_name, axis=1)
    target = df[target_name]
    return train_test_split(features, target, test_size=0.2, random_state=42)

if __name__ == "__main__":
  data_path = 'data.csv'
  df = load_data(data_path)
  df = clean_data(df)
  df = feature_engineering(df)
  df = preprocess_features(df)
  X_train, X_test, y_train, y_test = split_data(df)

In this refactored code snippet, each function only does one thing, making it easier to read the code. Testing itself would also become easier now, as each function can be tested isolated from the others.

And even the comments are not required anymore, as now the function names are like comments for themselves.

Comments

Comments are helpful sometimes. But sometimes, they are just an indication of bad code.

The proper use of comments is to compensate for our failure to express ourself in code [1].

Whenever you have to add a comment in your code, ask yourself if it is really required or if you could instead put that into a new function and name the function so that it gets clear what is happening and that the comment is not required.

Let's revise the code example from the functions chapter before:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

def load_clean_feature_engineer_and_split(data_path):
    # Load data
    df = pd.read_csv(data_path)

    # Clean data
    df.dropna(inplace=True)
    df = df[df['Age'] > 0]

    # Feature engineering
    df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 18, 65, 99], labels=['child', 'adult', 'senior'])
    df['IsAdult'] = df['Age'] > 18

    # Data preprocessing
    scaler = StandardScaler()
    df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])

    # Split data
    features = df.drop('Survived', axis=1)
    target = df['Survived']
    features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=42)
    return features_train, features_test, target_train, target_test

As you can see, it contains a few comments describing what happens in each code block.

However, the comments here are just an indicator of bad code! As described in the chapter before, putting these code blocks into a separate function and giving each function a descriptive name leads to better readability and makes the comments unnecessary:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

def load_data(data_path):
    return pd.read_csv(data_path)

def clean_data(df):
    df.dropna(inplace=True)
    df = df[df['Age'] > 0]
    return df

def feature_engineering(df):
    df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 18, 65, 99], labels=['child', 'adult', 'senior'])
    df['IsAdult'] = df['Age'] > 18
    return df

def preprocess_features(df):
    scaler = StandardScaler()
    df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
    return df

def split_data(df, target_name='Survived'):
    features = df.drop('Survived', axis=1)
    target = df['Survived']
    return train_test_split(features, target, test_size=0.2, random_state=42)

if __name__ == "__main__":
  data_path = 'data.csv'
  df = load_data(data_path)
  df = clean_data(df)
  df = feature_engineering(df)
  df = preprocess_features(df)
  X_train, X_test, y_train, y_test = split_data(df)

The main function now reads like a small story, and it is clear what is happening without needing comments.

But there is now one last piece missing: Docstrings

Docstrings are standard for Python with the aim of providing readable and understandable code.

Each function for production code should contain a docstring documentation describing its intent, input arguments, and information about the return value.

Docstrings are directly used by tools like Sphinx, whose aim is to create documentation for the code.

Let's now add the docstrings to the above code snippet:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

def load_data(data_path):
    """
    Load data from a CSV file into a pandas DataFrame.

    Args:
      data_path (str): The file path to the dataset.

    Returns:
      DataFrame: The loaded dataset.
    """
    return pd.read_csv(data_path)

def clean_data(df):
    """
    Clean the DataFrame by removing rows with missing values and 
    filtering out non-positive ages.

    Args:
      df (DataFrame): The input dataset.

    Returns:
      DataFrame: The cleaned dataset.
    """
    df.dropna(inplace=True)
    df = df[df['Age'] > 0]
    return df

def feature_engineering(df):
    """
    Perform feature engineering on the DataFrame, including age 
    grouping and adult identification.

    Args:
      df (DataFrame): The input dataset.

    Returns:
      DataFrame: The dataset with new features added.
    """
    df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 18, 65, 99], labels=['child', 'adult', 'senior'])
    df['IsAdult'] = df['Age'] > 18
    return df

def preprocess_features(df):
    """
    Preprocess features by standardizing the 'Age' and 'Fare' 
    columns using StandardScaler.

    Args:
      df (DataFrame): The input dataset.

    Returns:
      DataFrame: The dataset with standardized features.
    """
    scaler = StandardScaler()
    df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
    return df

def split_data(df, target_name='Survived'):
    """
    Split the dataset into training and testing sets.

    Args:
      df (DataFrame): The input dataset.
      target_name (str): The name of the target variable column.

    Returns:
      tuple: Contains the training features, testing features, 
             training target, and testing target datasets.
    """
    features = df.drop(target_name, axis=1)
    target = df[target_name]
    return train_test_split(features, target, test_size=0.2, random_state=42)

if __name__ == "__main__":
    data_path = 'data.csv'
    df = load_data(data_path)
    df = clean_data(df)
    df = feature_engineering(df)
    df = preprocess_features(df)
    X_train, X_test, y_train, y_test = split_data(df)

IDEs, like VSCode, typically offer extensions for docstrings so that they are added automatically as soon as you add the multiline string below a function definition.

This helps you to get the correct format of your choice quickly.

Formatting

Formatting is another very important concept.

Code is mainly read more often than it is written. And no one wants to read code that is not nicely formatted and, therefore, hard to grasp.

In Python, there is the PEP 8 style guide available that one can follow to make the code more readable.

Some important formatting rules available in this style guide:

Use four spaces for code indentation
Limit all lines to a maximum of 79 characters
Avoid extraneous whitespace in certain situations (i.e., inside brackets, between trailing comma and close parenthesis, …)

But remember: The formatting rules should make the code more readable. Sometimes, applying some of the rules doesn't make sense, as then the code would not be more readable. Then, ignore some of the rules.

Other important formatting rules mentioned in the book Clean Code:

Keep files small (around 200 to 500 lines) to keep them easier to understand
Use blank lines between different groupings to indicate different concepts (i.e. use a blank line between a code block initializing the ML model and the one running the training)
Define the caller function above the callee to give your program a natural reading flow

So, decide with your team on the set of rules that you want to follow and then stick to them!

You can use extensions in your IDE to support sticking to your guidelines. VSCode, for example, offers several extensions for that purpose.

You can use Python packages like Pylint and autopep8 to support formatting your Python scripts.

Pylint is a static code analyzer that would give your code a score out of 10, while autopep8 can automatically format your code to comply with the PEP8 standard.

Let's take a look into that using the code snippet from earlier in this article:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

def load_data(data_path):
    return pd.read_csv(data_path)

def clean_data(df):
    df.dropna(inplace=True)
    df = df[df['Age'] > 0]
    return df

def feature_engineering(df):
    df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 18, 65, 99], labels=['child', 'adult', 'senior'])
    df['IsAdult'] = df['Age'] > 18
    return df

def preprocess_features(df):
    scaler = StandardScaler()
    df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
    return df

def split_data(df, target_name='Survived'):
    features = df.drop(target_name, axis=1)
    target = df[target_name]
    return train_test_split(features, target, test_size=0.2, random_state=42)

if __name__ == "__main__":
  data_path = 'data.csv'
  df = load_data(data_path)
  df = clean_data(df)
  df = feature_engineering(df)
  df = preprocess_features(df)
  X_train, X_test, y_train, y_test = split_data(df)

Let's now save that to a file called train.py and run Pylint to check the score that we would get for that code snippet:

pylint train.py

This results in the following output:

************* Module train
train.py:29:0: W0311: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
train.py:30:0: W0311: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
train.py:31:0: W0311: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
train.py:32:0: W0311: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
train.py:33:0: W0311: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
train.py:34:0: C0304: Final newline missing (missing-final-newline)
train.py:34:0: W0311: Bad indentation. Found 2 spaces, expected 4 (bad-indentation)
train.py:1:0: C0114: Missing module docstring (missing-module-docstring)
train.py:5:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:5:14: W0621: Redefining name 'data_path' from outer scope (line 29) (redefined-outer-name)
train.py:8:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:8:15: W0621: Redefining name 'df' from outer scope (line 30) (redefined-outer-name)
train.py:13:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:13:24: W0621: Redefining name 'df' from outer scope (line 30) (redefined-outer-name)
train.py:18:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:18:24: W0621: Redefining name 'df' from outer scope (line 30) (redefined-outer-name)
train.py:23:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:23:15: W0621: Redefining name 'df' from outer scope (line 30) (redefined-outer-name)
train.py:29:2: C0103: Constant name "data_path" doesn't conform to UPPER_CASE naming style (invalid-name)

------------------------------------------------------------------
Your code has been rated at 3.21/10

Wow, only a score of 3.21 out of 10.

You could now go and fix these issues manually and then rerun it. Or you can use the autopep8 package to resolve some of these issues automatically.

Let's go with the second approach:

autopep8 --in-place --aggressive --aggressive train.py

The train.py script now looks like the following:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

def load_data(data_path):
    return pd.read_csv(data_path)

def clean_data(df):
    df.dropna(inplace=True)
    df = df[df['Age'] > 0]
    return df

def feature_engineering(df):
    df['AgeGroup'] = pd.cut(
        df['Age'], bins=[
            0, 18, 65, 99], labels=[
            'child', 'adult', 'senior'])
    df['IsAdult'] = df['Age'] > 18
    return df

def preprocess_features(df):
    scaler = StandardScaler()
    df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
    return df

def split_data(df, target_name='Survived'):
    features = df.drop(target_name, axis=1)
    target = df[target_name]
    return train_test_split(features, target, test_size=0.2, random_state=42)

if __name__ == "__main__":
    data_path = 'data.csv'
    df = load_data(data_path)
    df = clean_data(df)
    df = feature_engineering(df)
    df = preprocess_features(df)
    X_train, X_test, y_train, y_test = split_data(df)

After another run of Pylint, we got a score of 5.71 out of 10, which is mainly due to the missing docstrings of the functions:

************* Module train
train.py:1:0: C0114: Missing module docstring (missing-module-docstring)
train.py:6:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:6:14: W0621: Redefining name 'data_path' from outer scope (line 38) (redefined-outer-name)
train.py:10:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:10:15: W0621: Redefining name 'df' from outer scope (line 39) (redefined-outer-name)
train.py:16:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:16:24: W0621: Redefining name 'df' from outer scope (line 39) (redefined-outer-name)
train.py:25:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:25:24: W0621: Redefining name 'df' from outer scope (line 39) (redefined-outer-name)
train.py:31:0: C0116: Missing function or method docstring (missing-function-docstring)
train.py:31:15: W0621: Redefining name 'df' from outer scope (line 39) (redefined-outer-name)
train.py:38:4: C0103: Constant name "data_path" doesn't conform to UPPER_CASE naming style (invalid-name)

------------------------------------------------------------------
Your code has been rated at 5.71/10 (previous run: 3.21/10, +2.50)

I've now added the docstrings and fixed the last missing points.

The final code now looks like this:

"""
This script aims at providing an end-to-end training pipeline.

Author: Patrick 

Date: 2/14/2024
"""

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

def load_data(data_path):
    """
    Load dataset from a specified CSV file into a pandas DataFrame.

    Args:
        data_path (str): The file path to the dataset.

    Returns:
        DataFrame: The loaded dataset.
    """
    return pd.read_csv(data_path)

def clean_data(df):
    """
    Clean the input DataFrame by removing rows with 
    missing values and filtering out entries with non-positive ages.

    Args:
        df (DataFrame): The input dataset.

    Returns:
       DataFrame: The cleaned dataset.
    """
    df.dropna(inplace=True)
    df = df[df['Age'] > 0]
    return df

def feature_engineering(df):
    """
    Perform feature engineering on the DataFrame, 
    including creating age groups and determining if the individual is an adult.

    Args:
        df (DataFrame): The input dataset.

    Returns:
        DataFrame: The dataset with new features added.
    """
    df['AgeGroup'] = pd.cut(
        df['Age'], bins=[
            0, 18, 65, 99], labels=[
            'child', 'adult', 'senior'])
    df['IsAdult'] = df['Age'] > 18
    return df

def preprocess_features(df):
    """
    Preprocess the 'Age' and 'Fare' features of the 
    DataFrame using StandardScaler to standardize the features.

    Args:
        df (DataFrame): The input dataset.

    Returns:
        DataFrame: The dataset with standardized features.
    """
    scaler = StandardScaler()
    df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
    return df

def split_data(df, target_name='Survived'):
    """
    Split the DataFrame into training and testing sets.

    Args:
        df (DataFrame): The dataset to split.
        target_name (str, optional): The name of the target variable column. Defaults to 'Survived'.

    Returns:
        tuple: The training and testing features and target datasets.
    """
    features = df.drop(target_name, axis=1)
    target = df[target_name]
    return train_test_split(features, target, test_size=0.2, random_state=42)

if __name__ == "__main__":
    data = load_data("data.csv")
    data = clean_data(data)
    data = feature_engineering(data)
    data = preprocess_features(data)
    X_train, X_test, y_train, y_test = split_data(data)

Running Pylint now returns a score of 10:

pylint train.py

-------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 7.50/10, +2.50)

Awesome!

This really shows the power of Pylint in making your code cleaner and quickly sticking to the PEP8 standard.

Error Handling

Error handling is another critical concept.

Error handling ensures your code can handle unexpected situations without crashing or producing incorrect results.

Just imagine you have a model deployed behind an API, and users can send data to that deployed model. However, users may send the wrong data to that model, so the application could crash, which would not make a good impression on the users. They would, most probably, blame your application and claim it is not well developed.

It would be better if the users got back a specific error code and a message that clearly tells them what they did wrong.

And this is where Python Exceptions come into play.

Let's say users can upload a CSV file to your application, load that into the pandas data frame, and then forward that to your model to make predictions.

You would then have a function like the following:

import pandas as pd

def load_data(data_path):
    """
    Load dataset from a specified CSV file into a pandas DataFrame.

    Args:
        data_path (str): The file path to the dataset.

    Returns:
        DataFrame: The loaded dataset.
    """
    return pd.read_csv(data_path)

So far, so good.

But what happens when the user does not provide the CSV file?

Your program will crash with the following error message:

FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

As you are running an API, it would simply return an HTTP 500 code to the user, telling him that there was an "internal server error."

The user would probably blame your application for that, as he can't see that he is responsible for that error.

What is a better way of handling it?

Add a try-except block and catch the FileNotFoundError to handle that case properly:

import pandas as pd
import logging

def load_data(data_path):
    """
    Load dataset from a specified CSV file into a pandas DataFrame.

    Args:
        data_path (str): The file path to the dataset.

    Returns:
        DataFrame: The loaded dataset.
    """
    try:
        return pd.read_csv(data_path)
    except FileNotFoundError:
        logging.error("The file at path %s does not exist. Please ensure that you have uploaded the file properly.", data_path)

But now we are only logging that error message. It would be better to define a custom Exception that we can then handle in our API in order to return a specific error code to the user:

import pandas as pd
import logging

class DataLoadError(Exception):
    """Exception raised when the data cannot be loaded."""
    def __init__(self, message="Data could not be loaded"):
        self.message = message
        super().__init__(self.message)

def load_data(data_path):
    """
    Load dataset from a specified CSV file into a pandas DataFrame.

    Args:
        data_path (str): The file path to the dataset.

    Returns:
        DataFrame: The loaded dataset.
    """
    try:
        return pd.read_csv(data_path)
    except FileNotFoundError:
        logging.error("The file at path %s does not exist. Please ensure that you have uploaded the file properly.", data_path)
        raise DataLoadError(f"The file at path {data_path} does not exist. Please ensure that you have uploaded the file properly.")

And then, in the primary function of your API:

try:
    df = load_data('path/to/data.csv')
    # Further processing and model prediction
except DataLoadError as e:
    # Return a response to the user with the error message
    # For example: return Response({"error": str(e)}, status=400)

Now, the user would get back an error code 400 (Bad Request) with an error message that would tell him what went wrong.

He would now know what to do and would not blame your program anymore for not working correctly.

Object-Oriented Programming

Let's now come to another crucial concept that probably everyone working with Python at least is a bit aware of (sometimes people who started learning programming with Python take it as given): Object-Oriented Programming (OOP).

What is OOP?

Object-oriented programming is a programming paradigm that provides a means of structuring programs so that properties and behaviors are bundled into individual objects [2].

You can read more about OOP in [2], and plenty of other resources are available as well.

The main benefits of using OOP:

Objects hide data by encapsulation.
Code can be reused through inheritance.
Complex problems can be broken down into small objects, and developers can focus on one object at a time.
Enhanced readability.

And there are many more advantages. I highlighted the most important ones (at least to me).

Let's now look at a small example where a class called "TrainingPipeline" was created with a few base functions:

from abc import ABC, abstractmethod

class TrainingPipeline(ABC):
    def __init__(self, data_path, target_name):
        """
        Initialize the TrainingPipeline.

        Args:
            data_path (str): The file path to the dataset.
            target_name (str): Name of the target column.
        """
        self.data_path = data_path
        self.target_name = target_name
        self.data = None
        self.X_train = None
        self.X_test = None
        self.y_train = None
        self.y_test = None

    @abstractmethod
    def load_data(self):
        """Load dataset from data path."""
        pass

    @abstractmethod
    def clean_data(self):
        """Clean the data."""
        pass

    @abstractmethod
    def feature_engineering(self):
        """Perform feature engineering."""
        pass

    @abstractmethod
    def preprocess_features(self):
        """Preprocess features."""
        pass

    @abstractmethod
    def split_data(self):
        """Split data into training and testing sets."""
        pass

    def run(self):
        """Run the training pipeline."""
        self.load_data()
        self.clean_data()
        self.feature_engineering()
        self.preprocess_features()
        self.split_data()

This is an abstract base class, only defining the abstract methods that classes derive from the base class must implement.

This is really useful in defining a blueprint or template that all subclasses need to follow.

One example subclass could then look like the following:

import pandas as pd
from sklearn.preprocessing import StandardScaler

class ChurnPredictionTrainPipeline(TrainingPipeline):
    def load_data(self):
        """Load dataset from data path."""
        self.data = pd.read_csv(self.data_path)

    def clean_data(self):
        """Clean the data."""
        self.data.dropna(inplace=True)

    def feature_engineering(self):
        """Perform feature engineering."""
        categorical_cols = self.data.select_dtypes(include=['object', 'category']).columns
        self.data = pd.get_dummies(self.data, columns=categorical_cols, drop_first=True)

    def preprocess_features(self):
        """Preprocess features."""
        numerical_cols = self.data.select_dtypes(include=['int64', 'float64']).columns
        scaler = StandardScaler()
        self.data[numerical_cols] = scaler.fit_transform(self.data[numerical_cols])

    def split_data(self):
        """Split data into training and testing sets."""
        features = self.data.drop(self.target_name, axis=1)
        target = self.data[self.target_name]
        self.features_train, self.features_test, self.target_train, self.target_test = train_test_split(features, target, test_size=0.2, random_state=42)

This has the advantage that you can build an application that automatically calls methods of the training pipeline, and different classes of the training pipeline can be created. They are always compatible and must follow the blueprint defined in the abstract base class.

You can read more about abstract classes and their benefits in this article.

Testing

This chapter is one of the most important ones.

Having tests can decide about the success or failure of a whole project.

Creating code without tests is faster, as it seems like a "waste of time" when you also have to create unit tests for each function you write. And the code for writing unit tests can quickly outgrow the function's code.

But believe me, It is worth the effort!

You will feel the pain if you don't add unit tests quickly. Sometimes, not directly in the beginning.

But after your code base grows and you add more features, you will definitely feel the pain. Suddenly, adapting one function's code can lead to the failure of other functions. New releases require a lot of hotfixes. Customers get pissed. And the developers in the team fear adapting anything in the code base, leading to a very low velocity in releasing new features.

Therefore, always follow the Test Driven Development (TDD) principle whenever you work on code that should be productionized later!

The book Clean Code introduces three laws of TDD [1]:

Write a failing unit test first before you start writing production code.
Write not more of a unit test than is sufficient to fail.
Write not more production code than is sufficient to pass the currently failing test.

So, the tests are written before the production code. This leads to the developer thinking about what the function should do.

In Python, libraries like unittest **** or pytest can be used to test your functions.

I personally prefer pytest.

You can read more about testing in Python in this article. The article also focuses on integration tests, which is another vital aspect of testing to ensure that your system works end-to-end.

Let's again take a look into the ChurnPredictionTrainPipeline class from the chapter before:

import pandas as pd
from sklearn.preprocessing import StandardScaler

class ChurnPredictionTrainPipeline(TrainingPipeline):
    def load_data(self):
        """Load dataset from data path."""
        self.data = pd.read_csv(self.data_path)

    ...

And let's now add unit tests for loading the data using pytest:

import os
import shutil
import logging
from unittest.mock import patch
import joblib
import pytest
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from churn_library import ChurnPredictor

@pytest.fixture
def path():
    """
    Return the path to the test csv data file.
    """
    return r"./data/bank_data.csv"

def test_import_data_returns_dataframe(path):
    """
    Test that import data can load the CSV file into a pandas dataframe.
    """
    churn_predictor = ChurnPredictionTrainPipeline(path, "Churn")
    churn_predictor.load_data()

    assert isinstance(churn_predictor.data, pd.DataFrame)

def test_import_data_raises_exception():
    """
    Test that exception of "FileNotFoundError" gets raised in case the CSV
    file does not exist.
    """
    with pytest.raises(FileNotFoundError):
        churn_predictor = ChurnPredictionTrainPipeline("non_existent_file.csv",
                                                       "Churn")
        churn_predictor.load_data()

def test_import_data_reads_csv(path):
    """
    Test that the pandas.read_csv function gets called.
    """
    with patch("pandas.read_csv") as mock_csv:
        churn_predictor = ChurnPredictionTrainPipeline(path, "Churn")
        churn_predictor.load_data()
        mock_csv.assert_called_once_with(path)

These unit tests are:

Testing whether the CSV file can be loaded into a pandas data frame.
Testing that a FileNotFoundError exception gets thrown in case the CSV file does not exist.
Testing that the pandas "read_csv" function gets called.

The process was not entirely TDD, as I had already developed the code before adding the unit tests. But in the ideal case, you would write these unit tests even before implementing the load_data function.

Systems

Would you build a city all at once? Probably not.

The same applies to software.

Building a clean system is all about breaking it down into smaller components. Each component is built using clean code principles and is well-tested.

The most important part of this chapter is the separation of concerns:

Separate the startup process from the runtime logic, where dependencies are constructed.
Initiate all objects in your main function and insert them into the classes that depend on them (dependency injection)

This approach helps build the system incrementally, making it easy to expand and add more functionality later.

Emergence

In this chapter, the author presents four rules of Simple Design:

Run all tests.
Remove duplications.
Express the intent of the programmer.
Minimize the number of classes and methods.

The first rule is the most important, and the fourth is the least important.

Rules 2–4 focus on the refactoring aspect. Don't focus on them when you start coding. Don't try to get it right from the beginning.

Start simple (and even with ugly code) and then refactor to adhere to the above-mentioned rules.

Concurrency

Concurrency is sometimes helpful to speed up the process by cleverly jumping between tasks.

Concurrency can also be seen as a decoupling strategy, as the different parts need to be independent for concurrency to improve the overall runtime.

Concurrency also incurs some overheads and makes the program more complex, so decide wisely if it makes sense to invest the effort.

You, for example, need to deal with shared resources and synchronized access.

In Python, you can make use of the module asyncio. Read more about concurrency in Python in this article.

Refactoring

Refactoring your code improves readability and maintainability.

Always start simple, even with ugly code first. Get it to run. Then refactor. Remove duplications, improve namings, and reduce complexity.

But keep in mind to have your tests available before you start refactoring. This ensures that you are not breaking things when you refactor.

You should always refactor your code to make it cleaner. Too many developers, myself included when I started, have the mindset that my code runs now, so I push it and go on to the next task.

Kill that mindset! Otherwise, you will have a lot of issues as your codebase grows, and you now have to deal with ugly code that is hard to maintain.

Conclusion

Writing clean code is an art. It requires discipline and is not practiced often enough.

But it is extremely important for the success of software projects.

As a data scientist, you tend not to write clean code, as you mainly focus on finding good models and getting the metrics you aim for while running code in Jupyter Notebooks.

I also never cared about writing clean code when mainly working on data science projects.

However, it is essential also for data scientists to write clean code to ensure that the models make it to production faster.

As a rule of thumb, stick to clean code whenever you write code that is supposed to be reused.

You can always start simple. Don't overthink from the beginning onwards. Instead, refine your code iteratively.

And please never forget to write unit tests for your function to ensure proper functioning!

Not having tests can cause immense issues when you want to expand your functions. It can even lead to you not wanting to touch your codebase anymore.

Did you ever tell yourself, "Never change a running system"? You will never tell that again if you stick to the principles introduced in the book Clean Code and summarized in this article.

I'm still learning and adapting these principles to my daily life. And I can say they really help!

But I still have a long way to go as well, as you can't expect that you can stick to all the rules and master them overnight. It takes time.

Just stick to them and try to invest the effort. It will pay off!

One last note: Automate as much as possible!

Most IDEs offer an extensive set of extensions you can use to stick to the Clean Code rules.

You can, for example, refer to this LinkedIn article, which shows how to set up VSCode in a way that allows you to stick to clean formatting rules and follow naming conventions.

Thank you for reading my article to the end! I hope you enjoyed this article. If you want to read more articles like this in the future, follow me to stay updated.

Join my email list if you want to learn more about machine learning and the cloud.

Contact

LinkedIn | GitHub

References

[1]: Martin, R. C. (2008). Clean Code: A Handbook of Agile Software Craftsmanship. Prentice Hall. [2]: David Amos (2023). Object-Oriented Programming (OOP) in Python 3. Real Python. (Accessed at 2/16/2024).

Tags: Clean Code Data Science Python Software Development Tips And Tricks