Encapsulation: A Software Engineering Concept Data Scientists Must Know To Succeed

Author:Murphy  |  View: 28201  |  Time: 2025-03-22 19:04:03
Photo by Max Chen on Unsplash

Why You Should Read This Article

Data Science attracts a variety of different backgrounds. From my professional experience, I've worked with colleagues who were once:

  • Nuclear Physicists
  • Post-docs researching Gravitational Waves
  • PhD's in Computational Biology
  • Linguists

just to name a few.

It is wonderful to be able to meet such a diverse set of backgrounds and I have seen such a variety of minds lead to the growth of a creative and effective Data Science function.

However, I have also seen one big downside to this variety:

Everyone has had different levels of exposure to key Software Engineering concepts, resulting in a patchwork of coding skills.

As a result, I have seen work done by some data scientists that is brilliant, but is:

  • Unreadable – you have no idea what they are trying to do.
  • Flaky – it breaks the moment someone else tries to run it.
  • Unmaintainable – code quickly becomes obsolete or breaks easily.
  • Un-extensible – code is single use and its behaviour cannot be extended.

…which ultimately dampens the impact their work can have and creates all sorts of issues down the line.

  • unreadable code means bugs go unnoticed,
  • unnoticed bugs break models and/or data pipelines,
  • meanwhile, it takes some days, if at all, to even find the bug because the code is unreadable
  • whilst the original writer of the code has since left the company, so you can't get help,
  • leading to models being re-trained from scratch, data-pipelines being re-written, entire projects being redone.
Photo by David Pupăză on Unsplash

So, in a series of articles, I plan to outline some core software engineering concepts that I have tailored to be necessities for data scientists.

They are simple concepts, but the difference between knowing them vs not knowing them clearly draws the line between amateur and professional.

If you are planning to go into data science, be it a graduate or a professional looking for a career change, or a manager in charge of establishing best practices, this article is for you.

Today's Concept: Encapsulation

Photo by Laura Gariglio on Unsplash.

There are a lot of confusing and abstract definitions on the internet which overcomplicates a simple concept.

In essence, encapsulation is all about hiding complex details that doesn't matter for someone who is going to read or use your code.

You only want to reveal what it does and how someone can use it. An analogy I remember when learning this myself is:

All drivers know how to use a steering wheel. But most have no idea how or why it works, and they don't need to.

Encapsulation is also synonymous to information/data hiding.

The effective use of encapsulation should lead to your code being:

  • Easier to read.
  • Easier to maintain (i.e. diagnose and fix bugs).
  • Easier to extend (i.e. add or change functionality).

You've Seen Encapsulation Already

The scikit-learn suite of estimators is an ideal example of encapsulation.

KMeans

We all know how KMeans works and what it is supposed to do. We also know how to use the scikit-learn implementation of KMeans – it is all documented as docstrings.

from sklearn.cluster import KMeans
model = KMeans() # initialise a model
model.fit(X_train) # fit the model
model.predict(X_test) # predict on new data

But do we need to know how exactly the fit and predict method is implemented? Do I need to go through all 126 lines of code in fit to know how to use it?

No. All we need to know is that it outputs some clusters for the data we give it.

Even a newbie data scientist will be able to use this interface and understand it to some degree. This is the point of Encapsulation.

Photo by Dan Crile on Unsplash

We, as users, don't need to know the intricacies of what's happening under the hood. All the details are tucked away nice and neatly under a meaningful function called fit.

All we need to know is what the expected input and output of fit is, and use it appropriately.

Now, let's see how this applies to a typical data scientist's day-to-day work with a practical example in a Jupyter Notebook.

Practical Example

Let's start with an example task:

Build a model to predict Medium blog article earnings.

First thing's first – you need data to do this. For simplicity, let's assume some data is already available for you in the form of the below CSV file.

Article Lifetime Stats

This dataset identifies an article by a unique ID column ‘article', their various statistics and their earnings as of December 2024.

Image and Data by author. Example contents of the CSV file, ‘Article Lifetime Stats'.

As you can see, there are various issues with this data:

  1. published_date column has poorly formatted cells and invalid dates.
  2. claps have null values.
  3. comments (or number of comments on the article) have negative values which cannot happen, and is therefore invalid.
  4. earnings column has a cell with an invalid value of 9999999999.

To clean this up, let's say you open up a new Jupyter notebook and write the following code in a cell:

# can you figure out what's happening? How long did it take? 
# I bet most of you didn't even bother to read the code. 
# Why do you think that is?
df = pd.read_csv('~/Downloads/test.csv')
df.loc[df['published date'] == '2024-07-32', 'published date'] = '2024-08-01'
df['published date'] = pd.to_datetime(df['published date'], format='mixed')
df['claps'] = df['claps'].fillna(0)
df.loc[np.log10(df['earnings']) > 10, 'earnings'] = 0

This code will make perfect sense to the data scientist who wrote it. But what about your colleague or manager who is reading your notebook in the future to understand how you built the prediction model?

What if you were off on holiday and your work needed to be re-run to generate the same output on a different set of data?

What if you, the writer of this code, were forced to revisit this code a year later? Are you sure you can remember and understand every detail of why you wrote this code?

If I were reading this code, I would be thinking:

What on earth is going on here? What's important? Why were these operations done? Do I need to run the entire notebook and go through this code line-by-line to understand it?! FOR EVERY CELL?

Encapsulation At Work

To use encapsulation effectively in Data Science, you will need to think of the following:

  • Who is your audience?
  • Who might need to read this code in the future?
  • Why would they need to read this code?

Your code may need to be read by someone else due to many reasons:

  • To replicate results for auditing purposes.
  • To rerun code using different parameters.
  • To debug issues with the outputs or conclusions from this notebook that is impacting a downstream process.

Based on your judgement, you need to decide:

  • Which parts of your code are important and are intended to be used directly by the reader? (Parameters, File Paths, Constants)
  • Which parts of your code can be hidden away?

And the hiding away should make your code easier to read and understand by packaging them into manageable and meaningful chunks.

Now, consider the below code, rewritten using the principles of encapsulation.

In Cell 1 of your Jupyter Notebook:

def clean_published_date_column(col:pd.Series) -> pd.Series:
    """Cleaning steps for `published date` column:

    1. One entry of '2024-07-32' which is a date that is out of range
    for July. We replace this with 1st Aug.
    2. Dates are of type `str`, and in different formats. This is 
    handled using `pandas.to_datetime` with args `format='mixed'`.
    Parameters
    ----------
    col: pandas.Series
        A Series containing article published date entries.
    Returns
    -------
    col: pandas.Series
        The cleaned `published date` column.
    """
    jul_date_out_of_range_mask = (col == '2024-07-32')
    jul_date_out_of_range_replacement = '2024-08-01'
    col.loc[jul_date_out_of_range_mask] = jul_date_out_of_range_replacement
    return pd.to_datetime(col, format='mixed')
def clean_earnings_column(col:pd.Series):
    """Cleaning steps for `earnings` column:

    1. One entry where the value is 999999999999. We replace
    any values exceeding 10 figures with zero.
    Parameters
    ----------
    col: pandas.Series
        A Series containing article lifetime earnings data.
    Returns
    -------
    col: pandas.Series
        The cleaned `earnings` column.
    """
    earnings_too_big_mask = (np.log10(col) > 10)
    earnings_too_big_replacement = 0
    col.loc[earnings_too_big_mask] = earnings_too_big_replacement
    return col
def clean_claps_column(col:pd.Series):
    """Cleaning steps for `claps` column:

    1. Fill null values with zero.
    Parameters
    ----------
    col: pandas.Series
        A Series containing number of claps per article.
    Returns
    -------
    pandas.Series
        The cleaned `claps` column.
    """
    return col.fillna(0)

In Cell 2:

df = pd.read_csv('~/Downloads/test.csv')
# I bet you didn't read the functions above. 
# Well, that's the point of encapsualtion. You shouldn't need to to understand
# what's going on here.
df['published date'] = clean_published_date_column(df['published date'])
df['claps'] = clean_claps_column(df['claps'])
df['earnings'] = clean_earnings_column(df['earnings'])

It's pretty simple stuff, all we've done is put our existing code into their respective functions, given them meaningful names, and added docstrings to guide the reader on how to use the function and what its expected behaviour is.

You might be thinking

‘what a waste of time, this article is just telling me to put things into functions, big deal.'

Well, yes, partly.

Putting code into any function will always achieve some level of encapsulation. But effective encapsulation is not just about chucking everything into a function.

The difference would be analogous to

  • cleaning your room,

versus

  • Sweeping all your clothes and junk under the bed.
def clean(df):
  df.loc[df['published date'] == '2024-07-32', 'published date'] = '2024-08-01'
  df['published date'] = pd.to_datetime(df['published date'], format='mixed')
  df['claps'] = df['claps'].fillna(0)
  df.loc[np.log10(df['earnings']) > 10, 'earnings'] = 0
  return df
df = pd.read_csv('~/Downloads/test.csv')
df = clean(df) # the 'sweeping everything under the bed' equivalent.

Let's first inspect what the effective encapsulation has achieved, and how this was made possible.

What has Encapsulation achieved?

For the sake of brevity we go through two advantages most relevant for data science.

Also, please not that there is no right answer to how you do this. There could be many ways to apply encapsulation, depending on your intentions and assumptions. The below is my attempt, and I will try and justify why I have designed it in such a way and why it makes sense.

The most important thing is the effect the encapsulation achieves.

Better readability of your code

Firstly, we have created a hierarchy of information in terms of their detail and relevance to the reader.

df = pd.read_csv('~/Downloads/test.csv')
# data cleansing
df['published date'] = clean_published_date_column(df['published date'])
df['claps'] = clean_claps_column(df['claps'])
df['earnings'] = clean_earnings_column(df['earnings'])

I have purposefully encapsulated the nitty gritty details of ‘cleaning' this dataset into three deliberately designed functions, which in turn serves the following purpose:

  1. They indicate to the reader that some complex operation has been carried out on the input data,
  2. and it is up to the reader if they need to pursue the details, given their objective.
  3. If the reader needs to peek under the hood, they are free to peruse each function and their respective docstrings and immediately understand how each function works in manageable, meaningful chunks.
  4. If they don't need the details, they can move on quickly to sections relevant to them, knowing at least at a higher level what the above code is meant to do.

This approach assumes that there are more important sections such as model building or feature exploration further down in the notebook. Thus, I felt the section on data cleaning is not the main focus of this code.

I have therefore named each function in the format clean_xxx_column, assuming that it suffices for the reader to know ‘some cleaning is done here to create a workable dataset', and that they can skip the details if their interest is the model building/feature exploration.

Ultimately, encapsulation should make your code easier to read, exposing just the right level of information for the reader to get a gist of what's happening whilst avoiding a bombardment of information.

Easier readability makes it easier for others to use your code, to understand it and maintain it.

Better Extensibility and Separation of Concerns

Assume now that you receive a larger dataset from the same data source. You load it in, using the same code as before:

df = pd.read_csv(f'~/Downloads/bigger_test.csv')
# data cleansing
df['published date'] = clean_published_date_column(df['published date'])
df['claps'] = clean_claps_column(df['claps'])
df['earnings'] = clean_earnings_column(df['earnings'])

However, this time you find that additional cleaning steps are necessary; the claps column suffers from the same issue as the earnings column, where some values are extremely big: 9999999999

In this example, encapsulation shows two benefits:

  1. Extensibility: It is easy, even trivial, to extend our existing code to handle the new data issue. Even someone new to the code will easily know which function they need to modify.
def clean_claps_column(col:pd.Series):
    """Cleaning steps for `claps` column:

    1. Fill null values with zero.
    2. Replace any values exceeding 10 figures with zero.
    Parameters
    ----------
    col: pandas.Series
        A Series containing number of claps per article.
    Returns
    -------
    pandas.Series
        The cleaned `claps` column.
    """
    # additional functionality added here.
    # Note: since this code is repeated in `clean_earnings_column`, 
    # it could be put into its own function. For brevity we leave it
    # as is for now. 
    too_big_mask = (np.log10(col) > 10)
    too_big_value_replacement = 0
    col.loc[too_big_mask] = too_big_value_replacement
    # existing functionality
    return col.fillna(0)

rather than having to scour every line of code to know where to add the change:

df = pd.read_csv('~/Downloads/bigger_test.csv')
# I bet you won't even bother reading the below because it is so untidy.
df.loc[df['published date'] == '2024-07-32', 'published date'] = '2024-08-01'
df['published date'] = pd.to_datetime(df['published date'], format='mixed')
df['claps'] = df['claps'].fillna(0)
df.loc[np.log10(df['claps']) > 10, 'claps'] = 0
df.loc[np.log10(df['earnings']) > 10, 'earnings'] = 0
  1. Separation of concerns: the code change is neatly contained within the clean_claps_column function, whilst all other code remains identical.
df = pd.read_csv(f'~/Downloads/bigger_test.csv')
# Nothing changes here. Same as usual. 
df['published date'] = clean_published_date_column(df['published date'])
df['claps'] = clean_claps_column(df['claps'])
df['earnings'] = clean_earnings_column(df['earnings'])

The section of code responsible for running the cleaning doesn't need to worry about what's happening under the hood of any of the cleaning functions as long as the function behaviour does not change.

This enables code changes to be tidier and easier to manage.

What does this achieve?

The concept of encapsulation has been developed from decades of trial and error experience in the Software Engineering world.

The ultimate goal is to make code easier to maintain.

Photo by Pandu Agus Wismoyo on Unsplash

Poorly written code can quickly bloat into a storm of bugs and wasted man-hours on maintenance, leading to your precious time being wasted on bug-fixing rather than developing exciting new models.

Data Science has gone past the stage where it was purely research focused, and models could be hacked together in Jupyter notebooks for proof-of-concepts only.

The field has matured enough to make an impact on the real-world, and data science teams are only now realising that they actually need to write production-ready code and even worse, maintain them.

Unfortunately, I still see data scientists solely relying on notebooks and their unorganised code to maintain entire production pipelines. It makes me worry about the future of this industry when I see this.

Conclusion

In summary, encapsulation is a vital software engineering principle that can significantly enhance the readability, maintainability, and extensibility of your code.

By hiding complex details and exposing only what's necessary, you make your code more user-friendly and minimise rework.

For data scientists, mastering encapsulation not only improves the quality of your work but also sets you apart as a professional capable of building robust, production-ready systems.

Tags: Coding Data Science Getting Started Mlops Software Development

Comment