The Ultimate Guide to Evaluating the Impact of Outlier Treatment in Time Series

Author:Murphy | View: 28344 | Time: 2025-03-22 19:42:01

(If you do not have a membership, read the article here).

Picture this: You are working with time-series data, searching for patterns and investigating the trends over time.

You have done an exploratory data analysis to your time-series data and you have looked for the best methods to detect outliers in your data.

After detection, either you ignored them, removed them or, most likely, you have transformed them.

Now comes the time when you need to evaluate the impact of that treatment: how did the distribution of your data changed? How well is your Machine Learning model predicting the target variable?

Besides, one could be curious about:

What metrics will you use to assess the performance of the model?
How will you visualize the changes in data distribution?
What factors might have influenced the predictions of your models?
Is there any bias in the data that might affect evaluation?

We will answer some of these questions using a dataset, so that you can reproducible the results.

This article is the fourth part of a series of outlier treatment articles. Make sure you check the other three so you can have a complete picture on how to identify and treat Outliers in time-series data:

You might also want to check this one on important techniques to master time-series analysis:

5 Must-Know Techniques for Mastering Time-Series Analysis

In this final article of the series, we will explore how your choice of dealing with outliers influences how well your models behave.

Hello there!

My name is Sara Nóbrega, and I am a Data Scientist specializing in AI Engineering. I hold a Master's degree in Physics and I later transitioned into the exciting world of Data Science.

I write about data science, artificial intelligence, and data science career advice. Make sure to follow me and subscribe to receive updates when the next article is published!

Sara's Data Science Free Resources

Data Note

I simulated time-series data that represent energy production over 2 months using 10-minute interval measurements.

I generated this data in an attempt to simulate real-world patterns: energy production is higher during daylight hours and naturally lower in night hours.

About 10% of the data points were labeled as outliers to simulate spikes in production due to unusual events or errors in measurement.

This dataset serves as an example for illustrative purposes as we go through a number of techniques that can be used to check the effect of the treatment of outliers using some code snippets.

Here is how you can generate this data:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

# Setting parameters for dataset generation
start_date = datetime(2023, 1, 1)
end_date = datetime(2023, 3, 1)
time_interval = timedelta(minutes=10)

# Generate the datetime index with 10-minute intervals
datetime_index = pd.date_range(start=start_date, end=end_date, freq='10T')

# Generate base energy production values simulating daily cycles (higher in the day, lower at night)
np.random.seed(42)
base_energy = []
for dt in datetime_index:
    hour = dt.hour
    # Simulate higher production in daytime hours and lower at night
    if 6 <= hour <= 18:
        # Daytime production (higher)
        energy = np.random.normal(loc=300, scale=30)  # Example daytime mean production
    else:
        # Nighttime production (lower)
        energy = np.random.normal(loc=50, scale=15)  # Example nighttime mean production
    base_energy.append(energy)

# Convert base_energy to a pandas series for easier manipulation
energy_production = pd.Series(base_energy)

# Introduce outliers by replacing 10% of the data with extreme values
num_outliers = int(0.1 * len(energy_production))
outlier_indices = np.random.choice(len(energy_production), num_outliers, replace=False)
energy_production.iloc[outlier_indices] *= np.random.choice([1.5, 2, 2.5, 3], size=num_outliers)  # Scale up for outliers

# Creating the final DataFrame
energy_data = pd.DataFrame({'Datetime': datetime_index, 'Energy_Production': energy_production})

energy_data.head(5)

Datetime Energy_Production
0 2023-01-01 00:00:00 57.450712
1 2023-01-01 00:10:00 47.926035
2 2023-01-01 00:20:00 89.572992
3 2023-01-01 00:30:00 72.845448
4 2023-01-01 00:40:00 46.487699

Let's look at the data:

# Plotting the time-series data with outliers
plt.figure(figsize=(14, 6))
plt.plot(energy_data['Datetime'], energy_data['Energy_Production'], label='Energy Production', color='blue', alpha=0.7)
plt.xlabel('Time')
plt.ylabel('Energy Production (kW)')
plt.title('Simulated Energy Production with Outliers (10-Minute Intervals Over 2 Months)')
plt.legend()
plt.tight_layout()
plt.show()

Figure 1: Time-Series Plot of Simulated Energy Production with Outliers Over Two Months at 10-Minute Intervals | Image by Author.

How to Assess the Effect of the Outlier Treatment: A Practical Guide

Outliers are a pain when working with data. On one hand, they can skew your analysis or model. On the other hand, removing or modifying them sometimes just goes wrong.

On one hand, they can distort your analysis or model. On the other hand, removing or modifying them can sometimes go wrong. So how do you know you're doing the right thing in your dealings with outliers?

Why is it important to evaluate the effects of outlier treatments?

This is why you need to take a moment to evaluate the effects:

1.Model Accuracy and Interpretability

Outliers can distort your model's predictions. But they can also reveal important insights. For example, a sudden increase in customer spending might seem very unusual, but it may indicate a high-value customer. Therefore, it is important to ensure that treating outliers does not lead to misinterpretation.

2. Model Robustness

If you don't evaluate the impact on your model it might be too rigid. This means that it may work well on your training data. But it behaves badly when applied to new or unseen data.

Removing too many outliers can make your model brittle. While ignoring it can lead to over-adjustment.

3. Preserving Valuable Information

Remember that not all outliers are mistakes or anomalies.

In many cases, these can be important patterns in your data. Hence, it is important to consult with domain experts in the area and try to understand the reasons behind these discrepancies. Because important data may be lost if you delete them…

This is why post-treatment evaluations are necessary to verify that you are making the best decisions about what data points to keep or modify!

How to Evaluate the Impact of Outlier Treatment

Having established why this post-evaluation is an essential process, let's dive into how you can check if the treatment of an outlier was helpful or not.

Below are some tips used along with a code snippet to demonstrate how to do this.

1. Statistics Comparison Before and After Treatment

A simple point of comparison might be the basic statistics of data before and after treating outliers.

Next, imagine that you treated your outliers and decided to cap these extreme values.

You start by comparing the summary statistics to determine if the treatment significantly changed the shape of the data distribution.

Here's how you can do it in Python:


# Calculate outliers using the IQR method
Q1 = energy_data['Energy_Production'].quantile(0.25)
Q3 = energy_data['Energy_Production'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Count outliers based on IQR method
outliers_count_iqr = ((energy_data['Energy_Production'] < lower_bound) | 
                      (energy_data['Energy_Production'] > upper_bound)).sum()

# Calculate the percentage of outliers
percentage_outliers_iqr = (outliers_count_iqr / len(energy_data)) * 100

# Summary statistics before outlier treatment
before_treatment_stats = energy_data['Energy_Production'].describe()

# Apply outlier treatment (capping at 1st and 99th percentiles)
lower_cap = energy_data['Energy_Production'].quantile(0.01)
upper_cap = energy_data['Energy_Production'].quantile(0.99)
energy_data_capped_1_99 = energy_data['Energy_Production'].clip(lower=lower_cap, upper=upper_cap)

# Summary statistics after outlier treatment
after_treatment_stats_1_99 = energy_data_capped_1_99.describe()

# Display results
print("Outliers Count (IQR Method):", outliers_count_iqr)
print("Percentage of Outliers (IQR Method):", percentage_outliers_iqr)
print("nSummary Statistics Before Outlier Treatment:n", before_treatment_stats)
print("nSummary Statistics After Outlier Treatment (1st and 99th Percentiles):n", after_treatment_stats_1_99)

Outliers Count (IQR Method): 237
Percentage of Outliers (IQR Method): 2.789219724608685

Summary Statistics Before Outlier Treatment:
count    8497.000000
mean      209.829015
std       174.471194
min        -7.549833
25%        53.744142
50%       258.516008
75%       306.697167
max      1098.962075
Name: Energy_Production, dtype: float64

Summary Statistics After Outlier Treatment (1st and 99th Percentiles):
 count    8497.000000
mean      209.163717
std       171.357249
min        18.879831
25%        53.744142
50%       258.516008
75%       306.697167
max       880.731165
Name: Energy_Production, dtype: float64

The count of observations remains intact since the outlier treatment applied here, known as capping or Winsorization, does not eliminate any data points but instead changes the values of the most extreme observations to fall within a range in detail.

Summary Statistics Before Treatment:

Mean: 209.83
Standard Deviation: 174.47
Minimum: -7.55 (this is indicative of extreme low outliers)
Maximum: 1098.96 (this is indicative of extreme high outliers)

Summary Statistics After Treatment (1st and 99th Percentiles):

Mean: Slightly adjusted to 209.16
Standard Deviation: Reduced to 171.36
Minimum: Increased to 18.88
Maximum: Lowered to 880.73

We can see that capping at the 1st and 99th percentiles decreases both the range and the standard deviation.

Such a technique offers a very good balance between stability and realism; it allows the retention of a data set that would have less vulnerability to extreme values but still has representative values for further assessment or modeling.

2. Visual Evaluation

Time series data requires cleaning or treating outliers by using visualizations that consider temporal structure.

For example, histograms alone can not describe time trends. But there are other plot formats that are sometimes more appropriate:

Line Plots: Graph the values before treatment and after treatment on a line plot. This shows how the treatment affects the time series trend. From this display, one is able to make out if it smooths out the outliers without distorting the overall time-dependent structure of the series.
Rolling Window or Moving Average Plots: Plot moving average on line plots before and after treating outliers to show the treatment effect on the pattern of short-term trends. This sometimes shows if the treatments preserved or messed up the original pattern.
Zoomed-In Sections: Anomalies are more understandable in a moment of time. By zooming into the parts of the time series, you get to understand how the handling of outliers changes behavior in those parts.

Here is an example of a line plot before and after treatment:

import matplotlib.pyplot as plt

# Plotting energy production before and after outlier treatment (1st and 99th percentiles) on the same plot
plt.figure(figsize=(14, 6))
plt.plot(energy_data['Datetime'], energy_data['Energy_Production'], label='Before Outlier Treatment', color='blue', alpha=0.7)
plt.plot(energy_data['Datetime'], energy_data_capped_1_99, label='After Outlier Treatment (1st & 99th Percentiles)', color='cyan', linestyle='--', alpha=0.7)
plt.title('Energy Production Before and After Outlier Treatment (Capped at 1st and 99th Percentiles)')
plt.xlabel('Time')
plt.ylabel('Energy Production (kW)')
plt.legend()
plt.tight_layout()
plt.savefig('ff.png', format='png', dpi=300)
# Display the plot
plt.show()

Figure 2: Comparison of Simulated Energy Production Before and After Outlier Treatment Using 1st and 99th Percentile Capping | Image by Author.

This type of plot can be of great help if you are going to present results to managers or stakeholders that do not have deep technical knowledge.

3. Distribution Comparison (Kolmogorov-Smirnov Test)

The Kolmogorov-Smirnov test (KS test) is a non-parametric test that is able to compare the distribution of two datasets.

It can be used to check whether, after treatment, the distribution of your target variable – or features – has changed considerably. That's useful for checking if the removal of outliers has changed the overall shape of the data.

from scipy.stats import ks_2samp

# Performing the KS test to compare the original and capped distributions
ks_stat, p_value = ks_2samp(energy_data['Energy_Production'], energy_data_capped_1_99)

# Displaying the KS statistic and p-value
print("Kolmogorov-Smirnov Test Results")
print(f"KS Statistic: {ks_stat}")
print(f"P-Value: {p_value}")

Kolmogorov-Smirnov Test Results
KS Statistic: 0.010003530657879251
P-Value: 0.7888857198183333

What to look for:

A high p-value (> 0.05) indicates the distributions are similar, and, therefore, that treatment did not significantly alter the distribution of data.
A low p-value (< 0.05) means that, statistically, the treatment has altered the distribution of data in some useful way; that it treated the outliers, or it may be that undesired changes have been introduced which affect the overall structure of the data. Of course, one would likely want to do further analysis to make sure such changes are aligned with the goals of data pre-processing and modeling.

In our example:

KS Statistic: This statistic = 0.0100 indicates a negligible difference among the distributions for original and capped data. This is indicative of only a small shift in the distribution of the cumulative distribution after the treatment of the outliers.
P-Value: The p-value, 0.7889, is way over the rule-of-thumb threshold of 0.05. So we cannot reject the null hypothesis. This suggests that, statistically speaking, the energy production data before and after the treatment of outliers does not have significantly different distributions.

These results would, therefore, indicate that capping data at the 1st and 99th percentiles did clear out the extreme value without loss in the nature of the overall distribution.

This is an ideal outcome since reducing extreme values without distorting the nature of the original distribution of a dataset is usually desirable, with the aim of further analyses or models being able to generalize well based on the cleaned data.

4. Model Performance Metrics

The bottom line is that consideration about the impact due to treatment of the outliers actually simplifies into a question:

Does your model, with or without those changes, carry out any difference in performance?

You may consider metrics such as Mean Squared Error-MSE or R-squared for regression models.
Accuracy, precision, recall, and F1-score are the standard metrics on classification models.

Below is how you would monitor performance in case of a regression model before and after treatment of outliers.


import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Step 1: Data Preparation - Adding time-based features
energy_data['hour'] = energy_data['Datetime'].dt.hour
energy_data['day_of_week'] = energy_data['Datetime'].dt.dayofweek

# Additional cyclic features for hour and day of the week
energy_data['hour_sin'] = np.sin(2 * np.pi * energy_data['hour'] / 24)
energy_data['hour_cos'] = np.cos(2 * np.pi * energy_data['hour'] / 24)
energy_data['day_sin'] = np.sin(2 * np.pi * energy_data['day_of_week'] / 7)
energy_data['day_cos'] = np.cos(2 * np.pi * energy_data['day_of_week'] / 7)

# Lagged feature and rolling mean
energy_data['prev_hour_production'] = energy_data['Energy_Production'].shift(1)
energy_data['3hr_moving_avg'] = energy_data['Energy_Production'].rolling(window=3).mean()

# Drop rows with NaN values due to lagging and rolling mean
energy_data.dropna(inplace=True)

# Define features and target for original data
X = energy_data[['hour', 'day_of_week', 'hour_sin', 'hour_cos', 'day_sin', 'day_cos', 'prev_hour_production', '3hr_moving_avg']]
y = energy_data['Energy_Production']

# Train-test split for original data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 2: Train and evaluate the regression model on original data
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Calculate performance metrics for original data
mse_original = mean_squared_error(y_test, y_pred)
r2_original = r2_score(y_test, y_pred)

# Step 3: Apply outlier treatment (1st and 99th percentiles) on Energy_Production
lower_cap = energy_data['Energy_Production'].quantile(0.01)
upper_cap = energy_data['Energy_Production'].quantile(0.99)
energy_data_capped = energy_data['Energy_Production'].clip(lower=lower_cap, upper=upper_cap)

# Define features and target for capped data
y_capped = energy_data_capped[energy_data.index]  # Align capped data with features

# Train-test split for capped data
X_train_capped, X_test_capped, y_train_capped, y_test_capped = train_test_split(X, y_capped, test_size=0.2, random_state=42)

# Train and evaluate the regression model on capped data
model.fit(X_train_capped, y_train_capped)
y_pred_capped = model.predict(X_test_capped)

# Calculate performance metrics for capped data
mse_capped = mean_squared_error(y_test_capped, y_pred_capped)
r2_capped = r2_score(y_test_capped, y_pred_capped)

# Display results with clearer labels and three decimal places
print("Results for Model Performance Comparison:")
print("nOriginal Data Performance:")
print(f"Mean Squared Error (MSE): {mse_original:.3f}")
print(f"R-squared (R²): {r2_original:.3f}")

print("nOutlier-Treated Data Performance (1st and 99th Percentiles):")
print(f"Mean Squared Error (MSE): {mse_capped:.3f}")
print(f"R-squared (R²): {r2_capped:.3f}")

Results for Model Performance Comparison:

Original Data Performance:
Mean Squared Error (MSE): 4533.002
R-squared (R²): 0.840

Outlier-Treated Data Performance (1st and 99th Percentiles):
Mean Squared Error (MSE): 4325.025
R-squared (R²): 0.845

We do this by training a linear regression model on two versions of this dataset, one featuring the raw data and another for which energy production values were outlier-treated by capping at the 1st and 99th percentiles in order to reduce extreme values.

For both datasets, we added time-based features, such as hour and day of the week, cyclical transformations, and lagged values which may capture pattern and periodicity in energy production over time.

Results:

Original Data: The MSE was 4533.002 with an R-squared value of 0.840, showing a good level of fit but also mentioned that extreme values have a certain influence.
Outlier-Treated Data: On the other hand, MSE decreased to 4325.025, while R² improved marginally to 0.845. This reflects small, but positive, benefit from capping extreme values and should confirm a better general stability of model performance.

These minor enhancements of MSE and R² demonstrate that treatment of the outliers helped the model to be less sensitive to extreme values, generalizing at least a bit better.

From that point of view, this can be helpful in such cases when extreme values are present in a given dataset and distort the prediction, thereby making it possible to build a more reliable model of energy production forecasting.

5. Cross-Validation

Cross-validation of time-series data: this technique will ensure that your model generalizes well on data it has never seen. Keep in mind that it is not done in the same way as it is done with any regular dataset. I explore this and other concerns in this article:

5 Must-Know Techniques for Mastering Time-Series Analysis

Cross-validation splits your dataset into several subsets; it trains some and tests on others.

It should be done both prior to and after the treatment of outliers so that you could be sure the treatment really improved the robustness of the model.

Below is a code snippet for cross-validation:

import numpy as np
import pandas as pd
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import make_scorer, mean_squared_error

# Preparing the feature matrix and target variable for the original data
X = energy_data[['hour', 'day_of_week', 'hour_sin', 'hour_cos', 'day_sin', 'day_cos', 'prev_hour_production', '3hr_moving_avg']]
y = energy_data['Energy_Production']

# Define the cross-validation strategy for time series
tscv = TimeSeriesSplit(n_splits=5)

# Define the MSE scoring metric
mse_scorer = make_scorer(mean_squared_error, greater_is_better=False)

# Perform cross-validation before outlier treatment and log individual scores
cv_scores_before = cross_val_score(LinearRegression(), X, y, cv=tscv, scoring=mse_scorer)
print("Before Outlier Treatment - Cross-Validation MSE:", -cv_scores_before.mean())

# Store individual scores for analysis
cv_results_df = pd.DataFrame({
    'Fold': range(1, len(cv_scores_before) + 1),
    'MSE_Before_Outlier_Treatment': -cv_scores_before
})

# Apply outlier treatment (capping at 1st and 99th percentiles)
lower_cap = y.quantile(0.01)
upper_cap = y.quantile(0.99)
y_capped = y.clip(lower=lower_cap, upper=upper_cap)

# Perform cross-validation after outlier treatment and log individual scores
cv_scores_after = cross_val_score(LinearRegression(), X, y_capped, cv=tscv, scoring=mse_scorer)
print("After Outlier Treatment - Cross-Validation MSE:", -cv_scores_after.mean())

# Add individual scores to the DataFrame
cv_results_df['MSE_After_Outlier_Treatment'] = -cv_scores_after

# Display or save the results
print("nCross-Validation Results by Fold:")
print(cv_results_df)

Before Outlier Treatment - Cross-Validation MSE: 5870.508803176994
After Outlier Treatment - Cross-Validation MSE: 5400.8711159136

Cross-Validation Results by Fold:
   Fold  MSE_Before_Outlier_Treatment  MSE_After_Outlier_Treatment
0     1                   6221.065872                  5772.671486
1     2                   6044.473486                  5375.268558
2     3                   5914.049891                  5581.564532
3     4                   5837.194581                  5218.241578
4     5                   5335.760186                  5056.609425

If through cross-validation it shows that after treating the outliers, the model's performance has become good then you are on the right track.

In case it is otherwise, you might want to rethink how you're handling those outliers.

In our example:

First of all, we do cross-validation on the original data using Mean Squared Error-MSE, which is our evaluation metric. The performance in each fold was noted for proper analysis.

In order to reduce the impact of extreme values, we performed a percentiles capping at 1st and 99th on "Energy Production" in order to create the capped target variable, "y_capped".

Cross-validation was performed on this capped data, which allowed us to see the changes in model performance in each fold after outlier treatment.

Each MSE score from before and after outlier treatment in the folds was recorded in a DataFrame so that we could perform a fold-by-fold performance comparison.

These revealed that the average cross-validation MSE decreased from 5870.51 pre-outlier treatment to 5400.87 post-treatment. Zooming into each fold, we realize that MSE is consistently lower, hence reflecting on a more stable model.

Well, outlier treatment in this example was helpful in binding the most extreme values, hence making the performance of the model more consistent from one time split to another. This has added value to the robustness of the model as well as its generalization capability for the case at hand in time-series forecasting.

6. Residual Analysis

Residuals are the difference between observed and predicted values.

They will help you assess how good the fit of your model is after treatment of outliers. If some outliers have a poor effect on your model, you might see bigger residuals for that particular data point.

In an ideal case, it should be smaller and with equal variance after treatment.

You can plot residual plots to compare the spread before and after the treatment:

import matplotlib.pyplot as plt

residuals_after = y_test_capped - y_pred_capped

# Plot residuals before and after outlier treatment
plt.figure(figsize=(14, 6))

# Plot residuals before outlier treatment
plt.subplot(1, 2, 1)
plt.scatter(y_pred, residuals_before, color='blue', alpha=0.5)
plt.axhline(y=0, color='black', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.ylim(-400, 400)
plt.title('Residuals Before Outlier Treatment')

# Plot residuals after outlier treatment
plt.subplot(1, 2, 2)
plt.scatter(y_pred_capped, residuals_after, color='cyan', alpha=0.5)
plt.axhline(y=0, color='black', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.ylim(-400, 400)
plt.title('Residuals After Outlier Treatment')

plt.tight_layout()
plt.show()

Figure 3: Residual Analysis of Energy Production Predictions: Before and After Outlier Treatment with 1st and 99th Percentile Capping | Image by Author.

What to look for:

Residuals should be closer to zero and more normally distributed after the treatment.
The residual spread should not be as extreme, which means the fit is better.

In our example:

In the right-hand side plot, the residuals are overall smaller and more symmetrically distributed around zero at least for the main clusters of the predicted values. It has reduced the extreme residuals, which indicates that after capping outliers, the model fits the data in a much more consistent way. There is less dispersion, which will show that there is a reduction in the impact of extreme values.

Conclusion: Outlier treatment has reduced large residuals, improved model fit; therefore, the pattern of prediction is now much more stable and well-balanced.

That demonstrates that capping the extreme values resulted in a more robust model since this reduced the impact of outliers.

7. Sensitivity Analysis

Sensitivity analysis means changing different factors of your model and observing its responsiveness.

For the outliers, you would make small changes in how to handle those, such as changing thresholds or changing methods, and then see which treatment strategy gives most stable performance.

Here's an example of the sensitivity analysis using different upper and lower quantiles for capping:

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Define the quantiles to test
quantile_values = [(0.01, 0.99), (0.05, 0.95), (0.10, 0.90)]

# Prepare a list to store the results
results_list = []

# Define features and target variable
X = energy_data[['hour', 'day_of_week', 'hour_sin', 'hour_cos', 'day_sin', 'day_cos', 'prev_hour_production', '3hr_moving_avg']]
y = energy_data['Energy_Production']

# Loop through each quantile pair for sensitivity analysis
for lower_q, upper_q in quantile_values:
    # Apply quantile capping to the target variable
    lower_cap = y.quantile(lower_q)
    upper_cap = y.quantile(upper_q)
    y_capped = y.clip(lower=lower_cap, upper=upper_cap)

    # Train-test split with capped data
    X_train, X_test, y_train_capped, y_test_capped = train_test_split(X, y_capped, test_size=0.2, random_state=42)

    # Train the model and calculate predictions
    model = LinearRegression()
    model.fit(X_train, y_train_capped)
    y_pred_capped = model.predict(X_test)

    # Calculate performance metrics
    mse = mean_squared_error(y_test_capped, y_pred_capped)
    r2 = r2_score(y_test_capped, y_pred_capped)

    # Append results to the list
    results_list.append({
        'Lower Quantile': lower_q,
        'Upper Quantile': upper_q,
        'MSE': mse,
        'R²': r2
    })

# Convert the list of results to a DataFrame
sensitivity_results = pd.DataFrame(results_list)

# Display the results of the sensitivity analysis
print("Sensitivity Analysis Results:")
print(sensitivity_results)

Sensitivity Analysis Results:
   Lower Quantile  Upper Quantile          MSE        R²
0            0.01            0.99  4325.025305  0.844548
1            0.05            0.95  1824.866878  0.898331
2            0.10            0.90  1760.571111  0.886728

What to look for:

The model results should be consistent over different outlier treatments.
It is too sensitive if large changes in the performance are present while there are small changes in the quantile.

In our example: Capping at the 5th and 95th percentiles realizes the best R² performance, which strikes a great balance between reduction of extreme outliers and maintaining underlying structure.

The setting produces lower MSE and the highest R², reflecting that it can strike a better balance in this data set. More aggressive capping thresholds at the 10th and 90th percentiles slightly reduce MSE but also reduce R², reflecting diminishing returns.

Overall, capping at the 5th and 95th percentiles provides model sensitivity with the most stability or reliability of performance. This would reduce the impact of the extreme values but retains enough of the natural variation within the data.

8. Feature Importance Analysis

When working with tree-based models, outliers can disproportionately affect feature importance.

To handle this, you can compare the feature importance before and after the treatment of outliers so that you can verify if the most important features have remained stable.

Here's how you can do this (this example does not use our dataset):

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_mock, y_mock, test_size=0.2, random_state=42)

# Train Random Forest model on original data
rf_model = RandomForestRegressor(random_state=42)
rf_model.fit(X_train, y_train)

# Calculate feature importance before outlier treatment
feature_importance_before = pd.Series(rf_model.feature_importances_, index=X_mock.columns)

# Apply outlier treatment (capping at 5th and 95th percentiles)
lower_cap = y_mock.quantile(0.05)
upper_cap = y_mock.quantile(0.95)
y_capped = y_mock.clip(lower=lower_cap, upper=upper_cap)

# Train-test split on capped target
X_train_capped, X_test_capped, y_train_capped, y_test_capped = train_test_split(X_mock, y_capped, test_size=0.2, random_state=42)

# Train Random Forest model on capped data
rf_model.fit(X_train_capped, y_train_capped)

# Calculate feature importance after outlier treatment
feature_importance_after = pd.Series(rf_model.feature_importances_, index=X_mock.columns)

# Combine and display feature importance for comparison
feature_importance_comparison = pd.DataFrame({
    'Feature Importance Before Outlier Treatment': feature_importance_before,
    'Feature Importance After Outlier Treatment': feature_importance_after,
    'Absolute Change': (feature_importance_before - feature_importance_after).abs()
})

print("Feature Importance Comparison (Random Forest):")
print(feature_importance_comparison.sort_values(by='Absolute Change', ascending=False))

What to look for:

Large changes in feature importance might signal that certain features were being overly influenced by outliers.
Ideally, feature importance would remain similar between before and after treatment, which would imply that the outliers were not driving your model.

Wrapping It All Up

In this article we have explored some of the most popular methods to evaluate the impact of outlier treatment in time-series data.

We now know that it is important to make this evaluation so we understand the impact it has on model performance and data distribution.

Starting with basic statistics comparison, we understood how some fundamental estimations, such as mean and standard deviation, could easily give an idea of the shift in distribution due to outlier capping.

Visual inspections and comparison of the distribution using the Kolmogorov-Smirnov test help quantify the significant shift due to treatments of outliers.

We saw how comparing key metrics like MSE and R² pre and pos-treatment as well as cross-validation of time-series data help us make the best judgment regarding this treatment.

Other major steps involve residual analysis, where treatment of outliers reduces residuals, hence fitting the model better; sensitivity analysis to see the consistency of results among different thresholds of treatment; and last but not least, feature importance analysis to check whether core predictors remained stable despite adjustments in outliers.

If you found value in this post, I'd appreciate your support with a clap. You're also welcome to follow me on Medium for similar articles!

Book a call with me, ask me a question or send me your resume here: