Making the Case for Test-Driven Development in Machine Learning
I bet you've been there – sitting at that table or attending that Machine Learning (ML) project meeting where an ML engineer or data scientist reports on the time consumed by writing unit tests for code that may be already in a production phase. But let's pause and think about what unit tests are actually for. Here's a definition: A unit test is a block of code that verifies the accuracy of a smaller, isolated portion of the application code, typically a function or method. It's designed to ensure that this piece of code functions as expected, in line with the developer's initial design, requirements, and logic.
The core idea: ensuring that "the block of code runs as expected", is nothing new. Software developers have long embraced unit testing, often implementing these tests even before writing the actual code. However, ML engineers and data scientists often have a different background and approach. Their primary goal is to build mappings from an input X to a target Y, utilizing statistical methods, mathematics, and a range of pre-built libraries like SciKit Learn. But consider this: even before they can invoke a model.fit()
method, numerous coding tasks such as reading, validating, and transforming data, generating features, and ensuring data consistency must be handled.
All these tasks require careful coding. Do you see the big deal here?
Before we dig deeper, let's informally define Test-Driven Development (TDD) as the practice of writing tests before, or during, the development of the actual code. This approach not only ensures that each function is thoroughly tested from the start but also aligns the development process with both immediate and long-term project goals.
The big deal
Going back to the meeting, while you listen to that person explaining that they spent two days writing unit tests for only 10% of the code, you can't help but wonder: "How does this person even know that the rest of the code does what it's supposed to do?" Think about it – without some tests, are we just hoping for the best? Relying on luck that the code not only works now but will continue to work as it scales, as a library is updated, or as new features are integrated? This isn't just about catching bugs. It's about ensuring that the foundation we're building on is solid, reliable, and robust. After all, in the fast-evolving world of ML, who has the time to backtrack and fix preventable errors that could have been caught early on with proper Testing?
When you write unit tests post-mortem, as the data scientist from the introduction, you are disconnected from what was in your mind while writing that code. At that point, you are just looking for the tests to pass. That is the best you can do since the context details are completely lost at that moment, which may occur some months after writing the code.
However, how did this happen in the first place? How is it that someone writes tests after writing code serving some application in production?
The root of this issue often traces back to the environments and practices typical in machine learning development. ML projects are uniquely fast-paced and iterative, frequently conducted in interactive environments like Jupyter notebooks (the devil in disguise). These tools are invaluable for rapid prototyping and experimentation – they allow data scientists and ML engineers to quickly test hypotheses, adjust parameters on the fly, and visualize results in real-time. This approach is instrumental in navigating the complex landscape of machine learning where immediate feedback can significantly steer the direction of a project.
Yet, this very strength bears its weaknesses. The flexibility and ease of modifying code in a Jupyter notebook can lead to a casual attitude toward more structured software practices like writing unit tests. When code evolves in a notebook, it's easy to lose track of changes and the reasons behind them – each tweak or addition seems minor and controlled at the moment. As the codebase grows and becomes more complex, the lack of testing becomes a heavy burden.
Without unit tests, each piece of the system is a black box to any other developer – or even to the original author, months down the line (or days for that matter). As stated before, the initial conditions under which the code was created are forgotten, making it nearly impossible to guarantee that the code behaves as expected under all circumstances. This oversight is not just a minor lapse; it introduces significant risks, especially as the project scales or transitions from a research and development phase to production. Errors that could have been identified early through systematic testing might only surface after deployment, potentially leading to costly and time-consuming corrections in a live environment.
The tendency to retrofit unit tests after the fact, once the project is already in production, is an inefficient and often ineffective practice. At this stage, developers are no longer intimately connected with the details of every line of code, instead, they are effectively reverse-engineering their tests from the existing codebase. The primary goal shifts from ensuring that the code meets its design specifications to simply making the tests pass by any means necessary. This approach may validate basic functionality, but it hardly captures the subtleties of data-driven behavior and edge cases that are critical in machine learning applications. Think about this: it is like writing the exam questions after reading the answers.
To understand how helpful TDD may be I will share with you two stories featuring two data scientists: Mark and Maria.
Too much talking, show me some ugly code
In the first story observe Mark while he writes a function to do some preprocessing for a Pandas DataFrame column. His purpose is simple: "I need to multiply each element of the list by a factor because perhaps the sensor that generated that column of data was not properly calibrated".
"Mmm… Ok. Sensor is not calibrated but I happen to know the factor that can correct the issue. I just need a function that receives a list, a factor and returns the same list multiplied with a factor. It will be elegant because I will use a list comprehension."
def scale_list(numbers, factor):
return [x * factor for x in numbers]
"That was easy, this should do it. How do I know if this works? I can just call the function with some lists. Then off for lunch!"
scale_list([1, 2, 3], 2)
# Output:
# [2, 4, 6]
"Ok. That looks good."
scale_list([0, 0, 10, 10000], 5)
# Output:
# [0, 0, 50, 50000]
"Great! It works with zeros. One more."
scale_list([0.5, 1., 10, 10000], 1.65)
# Output:
# [0.825, 1.65, 16.5, 16500.0]
"I need my calculator for this one. Well it works with floating numbers. Good enough for me."
Mark proceeds then to use his function placing it into a module and import it into the preprocessing code like this.
# ---------------------
# tools.py
# ---------------------
def scale_list(numbers, factor):
return [x * factor for x in numbers]
# ---------------------
# preprocessing_pipeline.ipynb (yes a Jupyter notebook)
# ---------------------
from tools import scale_list
# read stuff and create a dataframe
# then use scale_list
factor = 1.5
df['scaled_values'] = df['original_values'].apply(lambda x: scale_list(x, factor))
And that is it. Stop for a moment and realize what has happened, what has been lost, and what the risks are.
…
Back here already?
No. Go back and think again.
…
In this scenario, the critical flaw is that Mark's entire reasoning and testing process, his informal trials with short lists of integers, zeros, and floating numbers, was discarded once the function was deemed satisfactory. Though crucial for his initial confidence in the function, these exploratory steps left no trace in the final code. No records of these tests exist; they evaporated as soon as he moved on. Consequently, if he or anyone else revisits this code even a few days later, the function might superficially appear robust and well-crafted: "It looks good," they'll say.
Now, let's consider the risks involved. Imagine, for any number of conceivable reasons (a miscommunication, a change in upstream data handling, or perhaps one of a myriad of other situations) the ‘original_values' column in the DataFrame suddenly looks like this: ['0', 1, 2., 0.5, 1]
, or even includes problematic entries like [0, 1.5, NaN]
.
What then?
The result is immediate: the system grinds to a halt, the data processing pipeline breaks down, and the model can no longer be retrained. The entire operation stalls, and the worst part? There's no straightforward way to discern whether this breakdown is due to an oversight in the initial design or a new, unforeseen anomaly.
Now show me some better code
Look at this second alternate story. Let us introduce to Maria, she is a data scientist who understands the importance of writing tests before writing the code (or even at the same time).
"Mmm… Ok. Sensor is not calibrated but I happen to know the factor that can correct the issue. I just need a function that receives a list, a factor and returns the same list multiplied with a factor. It will be elegant because I will use a list comprehension."
def scale_list(numbers: list, factor: float):
"""
Receives a list of numbers and returns a new list multiplied by a factor.
"""
return [x * factor for x in numbers]
"Looks good. Will it work? Let's see. First things first. This should work with integers and floating numbers."
Then she proceeds to write some boilerplate tests. She even wrote some type hints and documentation!
import unittest
class TestScaleList(unittest.TestCase):
def test_scaling_by_positive_integer_factor(self):
factor = 2
self.assertEqual(scale_list([1, 2, 3], factor), [2, 4, 6])
def test_scaling_by_positive_floating_factor(self):
factor = 2
self.assertEqual(scale_list([1., 2., 3.], factor), [2., 4., 6.])
if __name__ == '__main__':
unittest.main()
Realize that the effort to this is minimal. Observe that in the previous story, Mark wrote the code to test the function and then dropped it! The only difference between Mark and Maria is that the latter explicitly stated her thinking in a piece of code.
"Works for integers and floating numbers. What am I missing? It should work as well for negative factors. I should add another test and then I am off for the weekend."
import unittest
class TestScaleList(unittest.TestCase):
def test_scaling_by_positive_integer_factor(self):
factor = 2
self.assertEqual(scale_list([1, 2, 3], factor), [2, 4, 6])
def test_scaling_by_positive_floating_factor(self):
factor = 2
self.assertEqual(scale_list([1., 2., 3.], factor), [2., 4., 6.])
def test_scaling_by_negative_factor(self):
self.assertEqual(scale_list([1, 2, 3], -1), [-1, -2, -3])
if __name__ == '__main__':
unittest.main()
She then proceeds to push the changes.
After enjoying a relaxing weekend, Maria returns to work and reviews the tests she wrote before leaving. As she goes through them, he's struck by their clarity: they aren't just tests, they're a form of documentation. They swiftly bring her back to the state of mind she had when he wrote the code, making it almost immediate to recall why and how she designed the function. It is clear that the function was intended to handle integers, floating numbers, and negative factors. There is no ambiguity; the functionality is explicitly defined and safeguarded, not just in his memory but in the codebase, committed and pushed for all to see.
When the infamous lists containing strings and NaN values cause the system to crash again, the situation is frustrating, yet informative. Unlike before, there is no doubt about why the failure occurred: the code was explicitly not designed to handle such inputs. The presence of these tests, even if they don't cover every conceivable scenario, provides a significant advantage. They inform Maria, and anyone else who might work on the project, that the existing code safeguards were knowingly configured for specific data types.
This realization shifts the narrative from an unexpected failure to a known limitation. Now, instead of scrambling as Mark did to understand what went wrong, Maria can focus on enhancing the function to handle these edge cases, if necessary. Even a test suite that does not anticipate every possible error scenario is invaluable. It sets clear boundaries and expectations for how the code should operate, making any deviations from these expectations straightforward to diagnose and address.
Minimal effort, maximum impact: lessons from the two stories on the power of TDD
Reflecting on both scenarios, it's important to recognize that the additional effort required to write formal tests is minimal. In practice, the act of testing, whether formally through unit tests or informally through ad-hoc methods, requires the same preliminary steps: you provide inputs and verify outputs. The critical difference lies not in the amount of work but in the mindset and a slight adjustment in the process.
Transitioning to a testing framework involves a small shift in thinking: from a temporary check to a permanent test case. This change is accompanied by some initial setup or boilerplate code, which might seem like an overhead but quickly pays dividends. Writing these tests doesn't have to be time-consuming; it's about capturing the checks you're already performing as part of the development cycle in a structured format.
Basic Unit Testing is both quick and declarative. It does not require a considerable amount of time; rather, it integrates seamlessly into the development process. By investing a few extra minutes to write comprehensive, or even basic, unit tests, you embed a robust safety net that guards against unexpected behaviors and simplifies future code maintenance. This practice not only secures your application but also promotes a culture of quality and reliability within your team.
And what if Mark from the first story returns to write tests after writing code?
Even if he returns to add tests after developing his code, the effort is less effective. By this point, the original nuances and explorative thinking are no longer fresh in his mind. Consequently, he might simply write basic tests that ensure the code runs without errors, rather than creating tests that capture the full scope of functionalities and any edge cases he considered initially. As mentioned before, this retrofitted testing often results in a superficial validation that the code works under normal conditions but fails to safeguard against more complex scenarios. The tests might make the code pass, but they don't guarantee robustness or a deep understanding, highlighting the critical need for integrating testing as a fundamental part of the development process. Realize the story's scenario is a simple one, real-world scenarios are far more challenging and require more focus to understand what might be wrong after writing the corresponding code.
Final thoughts
In conclusion, Test-Driven Development (TDD) and basic unit testing are not just beneficial practices; they are essential for robust, reliable software development, especially in fields like ML where the data and conditions can be highly variable. These practices are fast and yield substantial value by ensuring that every piece of code not only meets its specifications but also maintains its integrity over time against unforeseen challenges.
For both Mark and Maria, embracing TDD could transform their workflow from one of uncertainty and frequent disruptions to one of confidence and efficiency. Instead of being reactive – scrambling to fix bugs after they derail their projects – both data scientists could be proactive, using unit tests to guide their code development from the very beginning. This shift would not only secure his applications but also enhance his credibility and capability as a professional.
By adopting TDD, you, Mark, and Maria, and indeed any ML engineer or developer, can ensure that the code not only works in the present but is also prepared for the future, making the software resilient and adaptable in the face of change.