Track Your ML Experiments

Author:Murphy | View: 23518 | Time: 2025-03-22 22:18:56

Every data scientist is familiar with experimentation.

You know the drill. You get a dataset, load it into a Jupyter notebook, explore it, preprocess the data, fit a baseline model or two, and then train an initial final model, such as XGBoost. The first time around, maybe you don't tune the hyperparameters and include 20 features. Then, you check your error metrics.

They look okay, but perhaps your model is overfitting a bit. So you decide to tune some regularization parameters (eg max depth) to reduce the complexity of the model and run it again.

You see a little improvement from the last run, but perhaps you want to also:

Add more features
Perform feature selection and remove some features
Try a different scaler for your features
Tune different/more hyperparameters

As the different kinds of tests you want to run increases, the more difficult it is to remember which combinations of your "experiments" actually yielded the best results. You can only run a notebook so many times, print out the results, and copy/paste them to a Google doc before you get frustrated.

This is where Experiment Tracking comes in.

As I mentioned in my article about becoming a great data scientist, having a formal way to track your experiments will make your life a lot easier and your results much clearer.

In this article I'll be walking you through how to set up an experiment using Neptune.ai, which allows you to run experiments on 1 project for free and will allow you to get familiar with the process. There are plenty of other great experiment tracking tools out there, but since I'm the most familiar with Neptune that's the one I'll be basing this guide off of. This is not promotional in any way – I just want to showcase what experiment tracking looks like in Python and am using Neptune as my tool of choice.

Getting started with Neptune

After you've pip installed Neptune and set up your Jupyter notebook environment, you'll need to link your notebook to the Neptune server. In order to connect to Neptune from Jupyter notebook and start running experiments, you'll need an API token. Once you've set up your account, logged in, and created your first project, click on your username in the bottom left corner and then select "Get your API token". Copy and paste this token into the following code to begin a Neptune run:

run = neptune.init_run(
    project="your-project-name",
    api_token="your-token",
)

This run object will be your communication point to and from the Neptune UI. When you first login to Neptune and create a new project, they should provide you with a code snippet similar to the one above with your credentials filled in. If not, just copy and paste my template.

Check out Neptune's quickstart guide here for more detailed instructions and troubleshooting information.

Data info

The dataset I used for this example was a time series dataset of energy consumption, with additional features such as temperature and wind speed included. The specifics of my dataset aren't too important for this tutorial, the main thing to know is that I did a regression task (as opposed to a classification task) using a time series dataset.

After loading in my dataset, the first thing I like to log are dataset related metrics, such as:

The mean, median, and mode of the dataset
The start and end dates of the dataset (if this is a time series dataset)
The sizes of my train and test sets (number of rows in each)

# Train and test sizes
train_size = int(df.shape[0] * 0.8)
df_train = df.iloc[0:train_size]
df_test = df.iloc[train_size:]
test_size = df_test.shape[0]

run["train/train_size"]=train_size
run["test/test_size"]=test_size

# Train and test set start and end dates
run["train/start_date"]=df_train.index.min()
run["train/end_date"]=df_train.index.max()

run["test/start_date"]=df_test.index.min()
run["test/end_date"]=df_test.index.max()

When logging information in the run, notice how I assigned values to keys the same way I would in a Python dictionary. A forward slash / basically creates a folder inside of which the field (of any given data type) will live. You can create multiple nested folders.

Here's what folders in a Neptune experiment will look like. When you click on them, you can see your fields inside.

Neptune folders for a run named "MED-10". Screenshot by author

Inside of "train" folder. Screenshot by author

Scikit learn API + summaries

Neptune offers integration with logging information about scikit-learn models such as Random Forest, XGBoost, Decision Tree and more through their scikit-learn integration API. All you need to do to access the methods offered in this API is to do a separate pip install:

pip install -U neptune-sklearn

One of the most useful functions that this integration offers is the summary function. If you're new to Neptune and still not quite sure which metrics you want to keep track of during experiments, as well as explore what kinds of offerings Neptune has in terms of metadata logging, summaries are a great way to get started.

With the scikit learn library, neptune offers three summary methods:

create_regressor_summary()
create_classifier_summary()
create_kmeans_summary()

Since I was training an XGBoost model (which is a scikit learn regressor) I used create_regressor_summary and assigned it to a folder called "summary".

import xgboost as xgb
import neptune.integrations.sklearn as npt_utils

# Define XGB model
bst = xgb.XGBRegressor(n_estimators=100)

# Fit/train XGB model
# Assuming I did my train test split already - code not shown
bst.fit(X_train, y_train)

run["summary"] = npt_utils.create_regressor_summary(bst,X_train,X_test,y_train,y_test)

Predictions and metrics

In the above summary method, Neptune automatically logs test predictions as an html file under test/preds. To do this manually (without calling the summary method) you can use the File datatype to convert a dataframe into an HTML and upload it as an HTML object in the run.

from neptune.types import File

predicted_test = bst.predict(X_test)
# Turn predictions into dataframe
test_predictions_df = pd.DataFrame(predicted_test,columns=['Predictions'])
# Upload as html
run["test/predictions"].upload(File.as_html(test_predictions_df))

Logging metrics, such as MSE, RMSE, and R² is arguably one of the most important parts of experiment tracking. Without keeping track of these metrics it's impossible to know if your model is improving or not. Neptune provides a get_scores() method which, depending on your model type (eg regressor or classifier) will output relevant scores. For single output regressors, this method will track:

R²
Explained variance
Max error
Mean absolute error (MAE)

Multi output regressors will track:

And classifiers will track:

Precision
Recall
F1 score
Support

run["scores_train"] = npt_utils.get_scores(bst, X_train, y_train)
run["scores_test"] = npt_utils.get_scores(bst, X_test, y_test)

I typically want to also include MSE, RMSE, and MAPE. Here is how to log specific metrics one at a time:

# Calculate metrics as normal
from sklearn.metrics import r2_score, mean_squared_error, 
mean_absolute_percentage_error

r2 = r2_score(y_test.values,predicted_test)
mse = mean_squared_error(y_test.values,predicted_test)
rmse = mean_squared_error(y_test.values,predicted_test,squared=False)
mape = mean_absolute_percentage_error(y_test.values,predicted_test)

run["scores/test/mse"] = mse
run["scores/test/rmse"] = rmse
run["scores/test/r2"] = r2
run["scores/test/mape"] = mape

Visuals and images

Neptune provides a few methods for generating informative charts, such as:

create_learning_curve_chart
create_feature_importance_chart
create_residuals_chart
create_prediction_error_chart
create_cooks_distance_chart
create_classification_report_chart
create_confusion_matrix_chart
create_roc_auc_chart
create_precision_recall_chart

For the purposes of my XGBoost model, the chart I really wanted to see was the feature importance chart. Though this does get created under "summary/diagnostics_charts" when you call create_regressor_summary(), you can create it individually as well.

# Create feature importance chart
run["visuals/feature_importances"] = npt_utils.create_feature_importance_chart(bst,X_train,y_train)

You are also able to log your own custom charts, such as a plotly figure that showcases the test set predictions plotted against the actual values.

import plotly.graph_objects as go

# Create a fig object using plotly go
fig = go.Figure([go.Line(x=y_test.index,y=predicted_test,name='predicted'),
          go.Line(x=y_test.index,y=y_test.values,name='actual')])

# Log the figure
run["visuals/test_predictions"] = fig

Stop run + View results

Once you are finished with an individual run/experiment, be sure that you end it by running:

run.stop()

If you don't, any metrics you log for a new run will overwrite the results from your previous run. And if you try to initialize a new run, you'll get an error.

Once you have completed your run and stopped it, you can navigate to the neptune UI and view the results under your run ID.

Besides looking at the metrics for an individual run, Neptune allows you to compare runs side by side. To do this, go to the eye symbol for the runs you want to compare and make sure it is set to visible (if it is not visible, there will be a strike through the eye).

Next, go to the very top of the page and select "compare runs".

From here, you should be able to compare runs based on images, charts, artifacts, and more. I personally recommend clicking the "side by side" tab, which allows you to compare all numerical values such as hyperparameters and metrics.

Overall + More documentation

Overall, experiment tracking is an important part of a data scientist's routine when it comes to building the most effective Machine Learning models. There are a variety of popular platforms including MLFlow, DVC, Neptune.ai and ClearML. For the purposes of this article, I chose to showcase Neptune because I've used it at work and have found it easy to set up. It's a good starting point for those who are trying to learn experiment tracking.

The main downside to Neptune is that it is not free after your first project. Most data scientists work on multiple projects so as a free solution, it isn't the best choice. For more cost effective tools, it's best to look for open source solutions such as MLFlow. The UI can also be a little bit challenging to navigate at times since there are so many options. It can feel overwhelming and take some time to get used to where everything is.

To dive deeper into Neptune and some of the offerings I discussed today, check out the official documentation, where setup, getting started and examples are given in more detail.