The Hierarchy of ML tooling on the Public Cloud

Author:Murphy | View: 23149 | Time: 2025-03-23 19:15:00

Hidden technical debt in ML systems. Image by Google Developers.

1 ML Services on the Public Cloud

Not all ML services are built the same. As a consultant working in the public cloud, I can tell you that you are spoilt for options for Artificial Intelligence (AI) / Machine Learning (ML) tooling on the 3 big public clouds – Azure, AWS, and GCP.

It can be overwhelming to process and synthesize the wave of information; especially when these services are constantly coming out with new features.

Just imagine how much of a nightmare it would be to explain to a layman which platform to choose, and why you chose to use this particular tool to solve your machine learning problem.

I'm writing this post to alleviate that problem statement for others, as well as for myself, so you walk away with a succinct and distilled understanding of what the public cloud has to offer. For the sake of simplicity, I will use the terms AI and ML interchangeably throughout this post.

2 Building a Custom ML System… should be a Last Resort

Before we jump into tooling comparison, let's understand why we should even use managed services on the public cloud. It's a valid assumption to question – Why not build your own custom infrastructure and ML model from scratch? To answer this question, let's take a quick look at the ML lifecycle.

The below diagram depicts a typical ML lifecycle (the cycle is iterative):

Machine Learning lifecycle. Image by author.

As you can see, there are many parts to the entire lifecycle that must be considered.

A famous paper published by Google showed that a small fraction of the effort that goes into building maintainable ML models in production is writing the model training code.

This phenomenon is known as the hidden technical debt of ML systems in production, and also what has been termed by industry as Machine Learning Operations (Mlops), which has become an umbrella term to refer to the mentioned technical debt.

Below is a visual explanation to support the above statistics, adapted from Google's paper:

I won't go into a detailed explanation of each stage in the lifecycle, but here's a summarized list of definitions. If you're interested in learning more, I would recommend reading Machine Learning Design Patterns Chapter 9 on ML Lifecycle and AI Readiness for a detailed answer.

ML lifecycle summarized definitions:

Data pre-processing – prepare data for ML training; data pipeline engineering
Feature engineering – transform input data into new features that are closely aligned with the ML model learning objective
Model training – training and initial validation of ML model; iterate through algorithms, train / test splits, perform hyperparameter tuning
Model evaluation – model performance assessed against predetermined evaluation metrics
Model versioning – version control of model artifacts; model training parameters, model pipeline
Model serving – serving model predictions via batch or real-time inference
Model deployment – automated build, test, deployment to production, and model retraining
Model monitoring – monitor infrastructure, input data quality, and model predictions

Don't forget about platform infrastructure and security!

The ML lifecycle does not consider the supporting platform infrastructure, which has to be secure from a encryption, networking, and identity and access management (IAM) perspective.

Cloud services provide managed compute infrastructure, development environments, centralized IAM, encryption features, and network protection services that can achieve security compliance with internal IT policies – hence you really should not be building these ML services yourself, and leverage the power of the cloud to add ML capabilities into your product roadmap.

This section illustrates that writing the model training code is a relatively tiny part of the entire ML lifecycle, and actual data prep, evaluation, deployment, and monitoring of ML models in production is difficult.

Naturally, the conclusion is that building your own custom infrastructure and ML model takes considerable time and effort, and the decision to do so should be a last resort.

3 Hierarchy of ML tooling

Here is where leveraging public cloud services come in to fill the gap. There are broadly two offerings these hyperscalers package and provide to customers; ML Tooling Hierarchy: