Simulated Data, Real Learnings : Part 1

Author:Murphy | View: 30016 | Time: 2025-03-22 22:28:15

distributions of model estimated coefficients on simulated data – image by author

Simulation is a powerful tool in the Data Science tool box. This is the first part of a multi-part series that discusses various ways that simulation can be useful in data science and machine learning. In this article, we will cover how simulation can be used to test machine learning approaches.

Specifically we'll go over how simulation can be used in the three ways below:

Testing Machine Learning approaches
Comparing different machine learning model performance
Evaluate model behavior in various circumstances

Before diving into this specific application of data simulation, let's define simulation.

WHAT IS DATA SIMULATION?

The definition of data simulation is pretty simple – it is the creation of fictitious data that mimics the properties of real-world data.

When do we want to simulate data?

when we want to have the ‘answer' to the questions that are not observable in the real world – i.e. with real world data, we can only infer the relationship between X and y; but with simulated data we create the relationship between X and y – with this ‘answer' we can test our machine learning and analytical approaches to see if they discover the relationship we simulated
when we don't have real data or we have very limited data
when we want to simulate things that have never happened before

Simulated data is often created using some amount of randomness. We will typically draw the randomness from probability distributions based on observed data or domain knowledge. For example, if we want to simulate productivity of orange trees, we could randomly draw from a distribution of orange tree productivity. We could create the probability distribution through observation (if we have a dataset of many orange trees' productivity) or we could draw from a statistical distribution that represents orange productivity – e.g. orange tree productivity is normally distributed with a mean of 150 lbs and a standard deviation of 24 lbs (I totally made this up, don't fact check me!).

It is important to note, that for our data simulation to be useful we always have to be comfortable with some assumptions. The specific assumptions will vary case by case – but the theme of the assumptions will be that our simulation is representative of conditions we will likely see in the ‘real world.' In the example of the orange trees, if we uses a historical dataset to create our simulation, we need to assume that the dataset we used matches the expectations for our orange trees. We would need to check that they are the same species of trees, grown in similar conditions etc.

With an understanding of what data simulation is, we'll dive into three simulation use cases.

TESTING MACHINE LEARNING APPROACHES

Typically, when we use machine learning, we are asking a question for which we do not know the answer. For example, what is the relationship(s) between our input data X, and our target variable y? Our machine learning technique seeks to answer this question. But, what if we want to check if our machine learning technique is finding the correct answer? We can't do it with real data, because we don't have the answer. One way to check the validity of a machine learning process is by ‘testing' the technique with simulated data.

We can check a machine learning technique with simulated data by creating the relationship that we want the technique to find. Now, we don't rely on the machine learning technique to tell us the relationship between X and y – we know the relationship, because we created it! We are focusing on what the machine learning model finds to verify that it finds the relationship we know is correct.

As a learning exercise, let's do this for linear regression. Below is the ‘answer' that we are going to simulate. We will do so by making three random variables and some noise. We'll then create in linear combination of the random variables and the noise to create a simulated target variable (sim y). Once we have the simulated data, we'll pass it through a linear regression to see if the linear model is able to capture the relationships we created.

Simulated linear equation – image by author

Below is the code to run the simple exercise:

Python">import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

# set up number of simulated samples to make
n = 500
# create noise and random variables
x1 = np.random.binomial(1, 0.4, n)
x2 = np.random.uniform(-5, 5, n)
x3 = np.random.normal(0, 2.5, n)
noise = np.random.normal(0, 0.25, n)
df = pd.DataFrame({'x1' : x1,
                   'x2' : x2,
                   'x3' : x3,
                   'noise' : noise})
# simulate target variable 
sim_y = 0.5*x1 + -0.1*x2 + 3*x3 + noise
# train linear regression on simulated data
lr = LinearRegression()
lr.fit(df[['x1', 'x2', 'x3']], sim_y)
print(lr.coef_)

Here is the output of the code block above:

estimated coefficients – image by author

As you can see, linear regression gave us a pretty close estimate of the coefficients that we modelled. Instead of just doing one Simulation, let's do many and look at the distributions of the coefficients that we generated.

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from matplotlib import pyplot as plt

# Create three columns and random noise
n = 500
num_simulations = 5000
# create a list of lists to hold coefs
coef_list = [[] for _ in range(num_simulations)]
# run the simulation multiple times
for i in range(num_simulations):
    # create random variablea
    x1 = np.random.binomial(1, 0.4, n)
    x2 = np.random.uniform(-5, 5, n)
    x3 = np.random.normal(0, 2.5, n)
    noise = np.random.normal(0, 0.25, n)
    large_noise = np.random.normal(0, 5, n)
    df = pd.DataFrame({'x1' : x1,
                       'x2' : x2,
                       'x3' : x3,
                       'noise' : noise,
                       'large_noise' : large_noise})
    # simulate target variable
    sim_y = 0.5*x1 + -0.1*x2 + 3*x3 + noise
    # train model for this iteration
    lr = LinearRegression()
    lr.fit(df[['x1', 'x2', 'x3']], sim_y)
    coef_list[0].append(lr.coef_[0])
    coef_list[1].append(lr.coef_[1])
    coef_list[2].append(lr.coef_[2])

# plot histograms of the three variables
fig, axs = plt.subplots(1, 3, figsize=(15, 5))
axs[0].hist(coef_list[0], bins=30, color='blue', alpha=0.7)
axs[0].set_title('X1')
axs[1].hist(coef_list[1], bins=30, color='green', alpha=0.7)
axs[1].set_title('X2')
axs[2].hist(coef_list[2], bins=30, color='red', alpha=0.7)
axs[2].set_title('X3')
plt.tight_layout()
plt.show()

distributions of model estimated coefficients – image by author

The distribution of our coefficients have means that correspond to how we simulated our data. So, from our simulation, we can conclude that linear regression is able to accurately capture the underlying relationships in our simulated data!

I chose linear regression for this example because of its simplicity and universality. Of course, there is no need to test that linear regression can pick up on linear trends— this has been well validated for decades. This technique can be very effective when testing out relatively new or un-validated ML approaches. It can also be effective when developing more custom solutions for problems. E.g. if you are ensembling multiple ML models together, you can validate that the ensemble's approach is capturing the relationships you want it to.

In summary, we can use simulation to essentially give our ML models a test that we have the answer key to. This can give us valuable insights of the model's ability to capture the kinds of relationships we simulate.

Side note, I did a very similar exercise for logistic regression in an article I wrote about how to interpret logistic regression coefficients:

How to Interpret Logistic Regression Coefficients

COMPARING DIFFERENT MACHINE LEARNING MODEL PERFORMANCES

Simulation gives us a controlled environment to compare machine learning models. In most cases, we want to know how different modeling approaches size up against each other with a specific dataset. In this case, simulation is not necessary. But, when we want to have a generalized understanding of how models stack up against each other, or we don't have a training dataset, simulation comes to the rescue!

Let's do a walk through of how simulation can help us compare ML approaches.

As an example, we'll simulate some data to compare linear regression and XGBoost. With real world data, it has been my experience that XGBoost always beats linear regression. However, linear regression should beat XGBoost if the training data are linear. We are going to test this using simulation!

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from matplotlib import pyplot as plt

# Create three columns and random noise
n = 500

def simulate_linear_data(n):

    '''
        Create a linear based simulation with 3 variables

        input
            n (int) : number of rows to simulate

        output
            sim_df (dataframe) : dataframe with simulated
                                 data
    '''

    x1 = np.random.binomial(1, 0.4, n)
    x2 = np.random.uniform(-5, 5, n)
    x3 = np.random.normal(0, 2.5, n)
    noise = np.random.normal(0, 0.25, n)

    sim_y = 0.5*x1 + -0.1*x2 + 3*x3 + noise

    sim_df = pd.DataFrame({'x1' : x1,
                           'x2' : x2,
                           'x3' : x3,
                           'noise' : noise,
                           'sim_y' : sim_y})

    return sim_df

# create simulated test and train data
train_df = simulate_linear_data(n)
test_df = simulate_linear_data(n)

# run xgb model
xgb_reg = XGBRegressor()
xgb_reg.fit(train_df[['x1', 'x2', 'x3']], train_df['sim_y'])
xgb_y_pred = xgb_reg.predict(test_df[['x1', 'x2', 'x3']])
xgb_rmse = np.sqrt(mean_squared_error(xgb_y_pred, test_df['sim_y']))
print(f'XGB RMSE = {xgb_rmse}')

# run linear regression model
lr = LinearRegression()
lr.fit(train_df[['x1', 'x2', 'x3']], train_df['sim_y'])
lr_y_pred = lr.predict(test_df[['x1', 'x2', 'x3']])
lr_rmse = np.sqrt(mean_squared_error(lr_y_pred, test_df['sim_y']))
print(f'LR RMSE = {lr_rmse}')

RMSE's for linear regression and XGB on linear based simulated data

Just as we thought, linear regression is outperforming XGBoost on linear data. These results show how the approaches compare with XGBoost's default parameters. Let's see if we can tune them to do a better job of capturing linear relationships.

Here is some code to do a simple grid search to tune our XGBoost model on the linear data. Let's see how this impacts the performance of the model.

# try tuning the XGB to improve performance
# grid search
max_depth_list = range(1, 9)
eta_list = [0.001, 0.01, 0.1, 0.3, 0.5, 0.8]
n_estimators_list = [1, 5, 10, 25, 50, 100, 500, 1000, 1500, 2000]

xgb_rmse_df = pd.DataFrame(columns = ['max_depth',
                                      'eta',
                                      'n_estimators',
                                      'rmse'])

for max_depth in max_depth_list:
    for eta in eta_list:
        for n_estimators in n_estimators_list:
            temp_xgb = XGBRegressor(max_depth = max_depth,
                                    eta = eta,
                                    n_estimators = n_estimators)
            temp_xgb.fit(train_df[['x1', 'x2', 'x3']], train_df['sim_y'])
            xgb_y_pred = temp_xgb.predict(test_df[['x1', 'x2', 'x3']])
            xgb_rmse = np.sqrt(mean_squared_error(xgb_y_pred, 
                                                  test_df['sim_y']))
            temp_xgb_df = pd.DataFrame({'max_depth' : [max_depth],
                                        'eta' : [eta],
                                        'n_estimators' : [n_estimators],
                                        'rmse' : [xgb_rmse]})
            xgb_rmse_df = pd.concat([xgb_rmse_df, temp_xgb_df])

xgb_rmse_df = xgb_rmse_df.sort_values(by = 'rmse')
print(xgb_rmse_df.head(10))

Simple grid search tuning helps XGBoost, but it still doesn't beat linear regression

While the parameter tuning improves XGBoost's performance, it still can't beat the linear regression. Thanks to the simulation we ran, we now see a situation in which linear regression beats XGBoost!

EVALUATE MODEL BEHAVIOUR IN VARIOUS CIRCUMSTANCES

We'll continue the linear regression example for this portion of the article as well. While we don't need simulation to test well-validated approaches (like linear regression) under ideal circumstances – we can use it to test if the approach works well under less-than-ideal situations. Doing this can give us an idea of how robust our ML approach is to possible future data challenges.

We'll go through four examples of less-than-ideal circumstances that we can test for; continuing our linear regression simulation work (note, this is not a comprehensive list of the things you can check for!):

Limited sample size
Outlier/extreme values
Data relationship drift
Increase in noise

Limited Sample Size

Let's say, that we are planning to develop a model where we have limited sample size. We can test how well our model picks up on the trends with fewer data points by simulating smaller datasets. We'll simplify this portion of the example by only looking at the coefficient for X1.

The code below iteratively creates simulations of small sample sizes, starting at 5 and going to 45. It simulates 5k datasets for each sample level.

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from matplotlib import pyplot as plt

# list of different number of observations
ns = [5, 15, 25, 35, 45]
# number of simulations
num_simulations = 5000
coef_list_x1 = []
coef_list_std = []
# iterate through number of observations
for n in ns:
    temp_coef_list = []
    # create multiple simulations
    for i in range(num_simulations):

        # create random variables
        x1 = np.random.binomial(1, 0.4, n)
        x2 = np.random.uniform(-5, 5, n)
        x3 = np.random.normal(0, 2.5, n)
        noise = np.random.normal(0, 0.25, n)
        large_noise = np.random.normal(0, 5, n)
        df = pd.DataFrame({'x1' : x1,
                           'x2' : x2,
                           'x3' : x3,
                           'noise' : noise,
                           'large_noise' : large_noise})

        # simulate target variable
        sim_y = 0.5*x1 + -0.1*x2 + 3*x3 + noise
        # train model
        lr = LinearRegression()
        lr.fit(df[['x1', 'x2', 'x3']], sim_y)
        temp_coef_list.append(lr.coef_[0])
    # save info from each observation's run
    temp_coef_mean = np.mean(temp_coef_list)
    temp_std = np.std(temp_coef_list)
    coef_list_x1.append(temp_coef_mean)
    coef_list_std.append(temp_std)
# plot data
plt.plot(ns, coef_list_x1, marker = 'o', label = 'Avg X1 Coef')
plt.plot(ns, coef_list_std, marker = 'o', label = 'STD of X1 Coef')
plt.title('X1 Ceof Distribution Metrics for 5k Simulation Runs')
plt.legend()
plt.show()

We can see that after about 15 observations, linear regression captures the average coefficient well, and after about 35 observations, we have a fairly low standard deviation of the coefficient for X1. This would be valuable knowledge if we were looking to create a linear model with similar data and limited observations.

We could go further and run more simulations to deduce the relationship between noise and needed sample size. We could answer questions like ‘if we expect our noise to look like y, how many observations do we need to estimate X1 without bias?'

Extreme Values

Often, we are interested in how our model makes predictions on extreme values. Simulation can help us understand this.

In the example below, we will simulate extreme values for X3. We will do this by creating a test dataset which will be created in the same way as the training dataset and then an extreme dataset where the mean of the X3's distribution will be 3 times as high as the test and the training X3s. This will give us an idea of how robust our model is to extreme values (note that the relationship of X3 to the target variable is the same). This time, instead of looking at the distribution of the estimated coefficients, we will be looking at the RMSE distributions of the test and extreme datasets.

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from matplotlib import pyplot as plt
from sklearn.metrics import mean_squared_error

# Create three columns and random noise
n = 500
num_simulations = 5000

coef_list_x1 = []
coef_list_std = []
coef_dict = {}

test_rmse_list = []
extreme_rmse_list = []

# simulate extreme values
for i in range(num_simulations):

    # train data
    x1 = np.random.binomial(1, 0.4, n)
    x2 = np.random.uniform(-5, 5, n)
    x3 = np.random.normal(0, 2.5, n)
    noise = np.random.normal(0, 0.25, n)

    # test data
    test_size = int(n/5)
    x1_test = np.random.binomial(1, 0.4, test_size)
    x2_test = np.random.uniform(-5, 5, test_size)
    x3_test = np.random.normal(0, 2.5, test_size)
    noise_test = np.random.normal(0, 0.25, test_size)

    sim_y_test = 0.5*x1_test + -0.1*x2_test + 3*x3_test + noise_test

    df_test = pd.DataFrame({'x1' : x1_test,
                            'x2' : x2_test,
                            'x3' : x3_test,
                            'noise' : noise_test})

    # extreme data
    x1_extreme = np.random.binomial(1, 0.4, test_size)
    x2_extreme = np.random.uniform(-5, 5, test_size)
    x3_extreme = np.random.normal(3, 2.5, test_size)
    noise_extreme = np.random.normal(0, 0.25, test_size)

    sim_y_extreme = (0.5*x1_extreme + -0.1*x2_extreme +
                     3*x3_extreme + noise_extreme)

    df_extreme = pd.DataFrame({'x1' : x1_extreme,
                               'x2' : x2_extreme,
                               'x3' : x3_extreme,
                               'noise' : noise_extreme})

    df = pd.DataFrame({'x1' : x1,
                    'x2' : x2,
                    'x3' : x3,
                    'noise' : noise})

    sim_y = 0.5*x1 + -0.1*x2 + 3*x3 + noise

    # full model
    lr = LinearRegression()
    lr.fit(df[['x1', 'x2', 'x3']], sim_y)

    extreme_pred = lr.predict(df_extreme[['x1', 'x2', 'x3']])
    extreme_rmse = np.sqrt(mean_squared_error(extreme_pred, sim_y_extreme))

    test_pred = lr.predict(df_test[['x1', 'x2', 'x3']])
    test_rmse = np.sqrt(mean_squared_error(test_pred, sim_y_test))

    test_rmse_list.append(test_rmse)
    extreme_rmse_list.append(extreme_rmse)   

plt.hist(test_rmse_list, alpha = 0.5, label = 'test rmse', density=True)
plt.hist(extreme_rmse_list, alpha = 0.5, 
         label = 'extreme value rmse', density=True)
plt.legend()
plt.show()

RMSE distribution for test and extreme value datasets – image by author

As we can see, our model holds up very well to extreme input values. The two RMSE distributions are very similar. This is expected for linear regression, because it does very well in extrapolating out linear relationships. Other ML approaches would likely struggle more, and we could get an idea of how much worse they would perform by running a simulation like this.

Data relationship drift

Here, I'm defining data relationship drift as a change in the underlying relationships that define a system. As an example, think about a customer's willingness to pay for an MP3 player. In the late 90's and early 2000's, customers were willing to pay a lot of money for MP3 players, but as phones started playing music, the relationship between consumers and MP3 players started to drift. Sensitivity to this type of change is what the simulation in this section is meant to capture.

In the simulation, we will focus on the variable X3 again. We will modify the relationship by four multiplicative factors in an out-of-sample dataset. We will then compare the RMSE distributions from an out-of-sample dataset with our relationship drift and the out-of-sample dataset with out the drift.

The code to do this is below:

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from matplotlib import pyplot as plt
from sklearn.metrics import mean_squared_error

# Create three columns and random n
n = 500
num_simulations = 10

coef_list_x1 = []
coef_list_std = []
coef_dict = {}

x1_shift_list = [0.90, 0.95, 1.05, 1.10]

test_rmse_dict = {}
drift_rmse_dict = {}

# simulate extreme values
for j in x1_shift_list:

    test_rmse_list = []
    drift_rmse_list = []

    for i in range(num_simulations):

        # train data
        x1 = np.random.binomial(1, 0.4, n)
        x2 = np.random.uniform(-5, 5, n)
        x3 = np.random.normal(0, 2.5, n)
        noise = np.random.normal(0, 0.25, n)

        # test data
        test_size = int(n/5)
        x1_test = np.random.binomial(1, 0.4, test_size)
        x2_test = np.random.uniform(-5, 5, test_size)
        x3_test = np.random.normal(0, 2.5, test_size)
        noise_test = np.random.normal(0, 0.25, test_size)

        sim_y_test = 0.5*x1_test + -0.1*x2_test + 3*x3_test + noise_test

        df_test = pd.DataFrame({'x1' : x1_test,
                                'x2' : x2_test,
                                'x3' : x3_test,
                                'noise' : noise_test})

        # extreme data
        x1_shifted = np.random.binomial(1, 0.4, test_size)
        x2_shifted = np.random.uniform(-5, 5, test_size)
        x3_shifted = np.random.normal(3, 2.5, test_size)
        noise_shifted = np.random.normal(0, 0.25, test_size)

        sim_y_shifted = 0.5*x1_shifted + -0.1*x2_shifted + 3*j*x3_shifted + noise_shifted

        df_shifted = pd.DataFrame({'x1' : x1_shifted,
                                   'x2' : x2_shifted,
                                   'x3' : x3_shifted,
                                   'noise' : noise_shifted})

        df = pd.DataFrame({'x1' : x1,
                        'x2' : x2,
                        'x3' : x3,
                        'noise' : noise})

        sim_y = 0.5*x1 + -0.1*x2 + 3*x3 + noise

        # full model
        lr = LinearRegression()
        lr.fit(df[['x1', 'x2', 'x3']], sim_y)

        shifted_pred = lr.predict(df_shifted[['x1', 'x2', 'x3']])
        shifted_rmse = np.sqrt(mean_squared_error(shifted_pred, sim_y_shifted))

        test_pred = lr.predict(df_test[['x1', 'x2', 'x3']])
        test_rmse = np.sqrt(mean_squared_error(test_pred, sim_y_test))

        test_rmse_list.append(test_rmse)
        drift_rmse_list.append(shifted_rmse)

    test_rmse_dict[j] = test_rmse_list
    drift_rmse_dict[j] = drift_rmse_list

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 8))

data_min = min(
               min(test_rmse_dict[0.90]), min(test_rmse_dict[0.95]),
               min(test_rmse_dict[1.05]), min(test_rmse_dict[1.10]),
               min(drift_rmse_dict[0.90]), min(drift_rmse_dict[0.95]),
               min(drift_rmse_dict[1.05]), min(drift_rmse_dict[1.10]),
               )
data_max = max(
               max(test_rmse_dict[0.90]), max(test_rmse_dict[0.95]),
               max(test_rmse_dict[1.05]), max(test_rmse_dict[1.10]),
               max(drift_rmse_dict[0.90]), max(drift_rmse_dict[0.95]),
               max(drift_rmse_dict[1.05]), max(drift_rmse_dict[1.10]),
               )
data_range = (data_min, data_max)

print(test_rmse_dict[0.9]==test_rmse_dict[0.95])

axes[0, 0].hist(test_rmse_dict[0.9], bins=30, color='blue', 
                alpha=0.5, range = data_range, label = 'Test RMSE')
axes[0, 0].hist(drift_rmse_dict[0.9], bins=30, color='green', 
                alpha=0.5, range = data_range, 
                label = 'Shifted Relationship RMSE')
axes[0, 0].legend()
axes[0, 0].set_title('Shifted relationship 0.90 multiplier for X3')

axes[0, 1].hist(test_rmse_dict[0.95], bins=30, color='blue', 
                alpha=0.5, range = data_range, label = 'Test RMSE')
axes[0, 1].hist(drift_rmse_dict[0.95], bins=30, color='green', 
                alpha=0.5, range = data_range, 
                label = 'Shifted Relationship RMSE')
axes[0, 1].legend()
axes[0, 1].set_title('Shifted relationship 0.95 multiplier for X3')

axes[1, 0].hist(test_rmse_dict[1.05], bins=30, color='blue', 
                alpha=0.5, range = data_range, label = 'Test RMSE')
axes[1, 0].hist(drift_rmse_dict[1.05], bins=30, color='green', 
                alpha=0.5, range = data_range, 
                label = 'Shifted Relationship RMSE')
axes[1, 0].legend()
axes[1, 0].set_title('Shifted relationship 1.05 multiplier for X3')

axes[1, 1].hist(test_rmse_dict[1.10], bins=30, color='blue', 
                alpha=0.5, range = data_range, label = 'Test RMSE')
axes[1, 1].hist(drift_rmse_dict[1.10], bins=30, color='green', 
                alpha=0.5, range = data_range,
                label = 'Shifted Relationship RMSE')
axes[1, 1].legend()
axes[1, 1].set_title('Shifted relationship 1.10 multiplier for X3')

# Set common labels
fig.text(0.5, 0.04, '', ha='center', fontsize=14)
fig.text(0.04, 0.5, '', va='center', rotation='vertical', fontsize=14)

# Adjust layout
plt.tight_layout()

# Show plot
plt.show()

RMSE distributions of out-of-sample simulated datasets with and without shifted relationships

We can see from the simulation, that our model is sensitive to relationship drift. Our errors have a higher mean and variance! Through simulation like this, we can understand what size of shifts will hurt our predictions and by about how much. We can make decisions like how often to do model monitoring and how often to refresh our model based on the learnings we can get from running this kind of simulation analysis.

More Noise

The last example we will explore is stress testing our model with more noise. This is to answer the question ‘how would our model perform in a noisier environment?' Note, that we are talking about random noise here, not noise introduced by a new variable – of course we could simulate that as well, but we'll save that for another time!

In the code below, I'm simulating multiple different noise levels and looking at the distributions of X1. We'll just look at X1 this time, again for simplicity. Our goal here is to see how increased noise impacts our model's ability to capture the true relationship of X1 with the target variable.

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from matplotlib import pyplot as plt

# Create three columns and random noise
noise_levels = [0.25, 1, 5, 10]
n = 500
num_simulations = 5000

coef_list_x1 = []
coef_list_std = []
coef_dict = {}

for noise_level in noise_levels:
    temp_coef_list = []
    for i in range(num_simulations):

        x1 = np.random.binomial(1, 0.4, n)
        x2 = np.random.uniform(-5, 5, n)
        x3 = np.random.normal(0, 2.5, n)
        noise = np.random.normal(0, noise_level, n)

        df = pd.DataFrame({'x1' : x1,
                        'x2' : x2,
                        'x3' : x3,
                        'noise' : noise})

        sim_y = 0.5*x1 + -0.1*x2 + 3*x3 + noise

        # full model
        lr = LinearRegression()
        lr.fit(df[['x1', 'x2', 'x3']], sim_y)
        temp_coef_list.append(lr.coef_[0])

    coef_dict[noise_level] = temp_coef_list

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 8))

data_min = min(min(coef_dict[0.25]), min(coef_dict[1]), min(coef_dict[5]), min(coef_dict[10]))
data_max = max(max(coef_dict[0.25]), max(coef_dict[1]), max(coef_dict[5]), max(coef_dict[10]))
data_range = (data_min, data_max)

axes[0, 0].hist(coef_dict[0.25], bins=30, color='blue', 
                alpha=0.5, range = data_range)
axes[0, 0].set_title('Error with STD = 0.25')
axes[0, 1].hist(coef_dict[1], bins=30, color='green', 
                alpha=0.5, range = data_range)
axes[0, 1].set_title('Error with STD = 1')
axes[1, 0].hist(coef_dict[5], bins=30, color='red', 
                alpha=0.5, range = data_range)
axes[1, 0].set_title('Error with STD = 5')
axes[1, 1].hist(coef_dict[10], bins=30, color='orange', 
                alpha=0.5, range = data_range)
axes[1, 1].set_title('Error with STD = 10')

# Set common labels
fig.text(0.5, 0.04, '', ha='center', fontsize=14)
fig.text(0.04, 0.5, '', va='center', rotation='vertical', fontsize=14)

# Adjust layout
plt.tight_layout()

# Show plot
plt.show()

X1 coefficient distributions with different levels of noise introduced

We get some great learnings from this simulation. The first takeaway is that our model is still unbiased, meaning that the average of the distributions is close to the 0.5 we simulated. This is good news! We do however have bad news. As you can see the variance of the distribution increase proportional to the noise – this definitely makes sense! With higher noise, 0.5 is the number most likely for our model to find, but the probability of being further away from 0.5 increases as our noise increases. We can use learnings from this type of simulation to get an idea of how much confidence we should have in our model's point estimates.

WRAP UP

Simulation can be a powerful and effective tool to get learnings about the machine learning approaches we take to solve specific problems. We can learn if our models can pick up on the types of trends we are looking to capture, how our models compare to other models and how our models handle various less-than-ideal modeling circumstances. While we looked at a few specific examples to cover the areas mentioned above, simulation is very flexible and can be modified to give the specific learnings you are looking for!

Tags: Data Science Hands On Tutorials Machine Learning Python Simulation