Deal with Missing Data Like a Pro: Multivariate and Iterative Imputation Algorithms

Author:Murphy | View: 20633 | Time: 2025-03-22 19:16:48

Real-world data is mostly messy and requires careful preprocessing before using in any machine learning (ML) model. We almost always face the null values in our datasets, which could have been highly valuable for our analysis or modelling if observed. We refer it as the missingness in the data.

There can be various reasons behind the missingness, such as the malfunction of a device, a non-mandatory field in the ERP system, or a non-applicable question in a survey for the participants. Depending on the reason, the nature of the missingness also varies. How we can understand this nature is explained in detail in my previous article. In this article, the focus is mostly on how to handle this missingness properly without causing bias or loss of critical insights by deletion or Imputation.

Red Wine Quality data by UCI Machine Learning Repository is used in this article [1]. It is an open source dataset which is available and can be downloaded through this link.

It is essential to understand the nature of the missingness (MCAR, MAR, MNAR) to decide on the correct handling methodology. Therefore, if you think you need more information on that, I suggest you to initially read my previous article.

1. Deletion Methods

There are two deletion methods which should be carefully used in order not to lose any important information:

Listwise deletion: The rows with any missing data are completely deleted for simplification.

Pairwise deletion: The missingness cases are evaluated separately for particular analysis or model. All the available data is used instead of deleting the rows entirely.

Deletion methods could be applied either the number of missing values is really small so deleting them will not effect the analysis, or the number of missing values is really high (> 50%) so it is not possible to make an accurate prediction of the missing values based on available data.

2. Imputation Methods

Instead of deleting or excluding the missing data, we can estimate the most likely value of the missing point and impute it. We can classify imputation methods into two:

Univariate imputation is replacing the null values by considering only the values of the same, single variable which has missing data. These methods could be used where the Missingness is Completely at Random (MCAR). Remember, there is nothing systematic or related to other variables about the missingness of the data for the MCAR case.

Multivariate imputation, on the other hand, considers other variables in the dataset to predict the most probable value to impute. It does not solely focus on the variable with the missing data.

Univariate imputation methods are more self-explanatory and rather easier to implement. Many good articles are already available that you might have a look for more information about these imputation techniques. Moreover, we can expect multivariate imputation methods to offer better results in feature selection compared to basic techniques [2]. This is why we will focus on Multivariate Imputation algorithms in this article.

You may see the short descriptions of the Univariate Imputation techniques below:

Mean/Median Imputation: replacing missing values with the mean/median of the observed. Mean is preferred when the data distribution is – close to – normal, median is preferred for skewed data.

Mode Imputation: replacing missing values with the most frequent value. It is preferred for the categorical data.

Forward/Backward Fill: replacing missing values with the previous or next observed value. It is preferred for the time series data

Random Sample Imputation: replacing with random values drawn from the data.

Now, let's start discussing Multivariate Imputation. We will implement multivariate imputation using Lightgbm, kNN and Autoencoders using Python in the upcoming sections. We will use ScikitLearn libraries for the applications. Regardless of the model we implement, the steps followed will be similar. Therefore, you can follow the same logic and tailor the provided codes to apply your desired prediction model to estimate missing values.

Initially, let's import the all required packages that will be used in this article:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense
from lightgbm import LGBMRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from tensorflow.keras.optimizers import Adam
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge

Extract the data:

# URL to the dataset (replace this with the actual URL)
path= "winequalityred.csv"

# Read the dataset from a URL or local file path
data = pd.read_csv(path, sep=';', header=0)

print(data.head())

This dataset initially does not have null values. I will delete random values to prepare a dataset with NaNs to work on:

# Set a random seed for reproducibility
np.random.seed(42)

# Function to set NaNs in random cells
def introduce_random_nans(df, fraction=0.2):

    # Determine number of cells to replace with NaN
    total_cells = df.size
    num_nans = int(total_cells * fraction)  # Fraction of cells to nullify

    # Randomly select indices for rows and columns
    row_indices = np.random.randint(0, df.shape[0], num_nans)
    col_indices = np.random.randint(0, df.shape[1], num_nans)

    # Introduce NaNs
    for r, c in zip(row_indices, col_indices):
        df.iat[r, c] = np.nan

    return df

# Introduce missing values in 20% of the data
df = introduce_random_nans(data, fraction=0.2)

# Check the resulting DataFrame
print(df.head())

First few rows of the new dataset with missing data

Let's choose column ‘sulphates' for the imputation of the missing values. We should perform preprocessing steps. I will write the code as if the data has both numeric and categorical columns so you can use it for other datasets as well.

2.1. Multivariate Imputation using Light Gradient Boosting Machine (LightGBM)

Briefly, LightGBM is an optimized way of implementing gradient boosting for high performance. It is widely used for the modelling of the tabular data to increase the speed and decrease the memory usage. Thus, it is an efficient tool for imputation of the numeric variables.

LGBMClassifier is a classification algorithm for the prediction of categorical values. LGBMRegressor is the regression model for the estimation of numeric values. Although we call it regression, LGBMRegressor does not apply any explicit type of traditional regression models like linear, polynomial or logistic regression. It uses decision trees and combines them with a gradient boosting framework. For more information, you can look at the the documentation.

Let's look how we can implement it in Python for our dataset after the steps described in Section 2.

Build the ImputerLightGBM class:

class ImputerLightGBM:
   def __init__(self, path, target_column, null_fraction=0.2, null_threshold=0.5):
     """
     Class initialization for loading, preprocessing, and imputing data.
     """
     self.path = path
     self.target_column = target_column
     self.null_fraction = null_fraction
     self.null_threshold = null_threshold
     self.data = None
     self.nan_ix = None
     self.scaler = StandardScaler()
     self.model = LGBMRegressor()

   def load_data(self):
     """
     Loads the data from the given file path.
     """
     self.data = pd.read_csv(self.path, sep=';', header=0)
     print("Initial data loaded:")
     print(self.data.head())

   def drop_high_null_columns(self):
     """
     Drop columns with too many NaNs, based on the null_threshold.
     """
     cols_to_drop = [col for col in self.data.columns if
     self.data[col].isnull().sum() > self.null_threshold * len(self.data)]
     self.data.drop(columns=cols_to_drop, inplace=True)
     print(f"Dropped columns: {cols_to_drop}")

   def preprocess_column(self):
     """
     Preprocesses the target column by creating an 'is_nan' indicator column for missing values.
     """
     nan_ix = np.where(self.data[self.target_column].isna())[0]
     data = self.data.copy()
     data['is_nan'] = 0
     data.loc[nan_ix, 'is_nan'] = 1

     # Handle categorical columns
     for cat_col in data.select_dtypes(include='object'):
     data[cat_col], _ = pd.factorize(data[cat_col])

     self.data = data
     self.nan_ix = nan_ix
     print(f"Missing index for {self.target_column}: {nan_ix}")

   def split_and_scale(self, val_size=0.2):
     """
     Splits data into train/validation sets and scales it.
     """
     features = self.data.drop([self.target_column, 'is_nan'], axis=1)
     target = self.data[self.target_column]

     # Separate the test rows (nan_ix)
     X_train_and_val = features.drop(index=self.nan_ix)
     X_test = features.loc[self.nan_ix]
     y_train_and_val = target.drop(index=self.nan_ix)

     # Split the train and validation set
     X_train, X_val, y_train, y_val = train_test_split(
     X_train_and_val, y_train_and_val, test_size=val_size, random_state=42
     )

     # Scale features
     X_train_scaled = self.scaler.fit_transform(X_train)
     X_val_scaled = self.scaler.transform(X_val)
     X_test_scaled = self.scaler.transform(X_test)

     print("Data split and scaling complete.")
     return X_train_scaled, y_train, X_val_scaled, y_val, X_test_scaled

   def fit_and_validate(self, X_train_scaled, y_train, X_val_scaled, y_val):
     """
     Train the model and compute validation RMSE.
     """
     self.model.fit(X_train_scaled, y_train)
     y_val_pred = self.model.predict(X_val_scaled)
     rmse = mean_squared_error(y_val, y_val_pred, squared=False)
     print(f"Validation RMSE: {rmse}")
     return rmse

   def impute_missing_values(self, X_train_and_val, y_train_and_val, X_test_scaled):
     """
     Refit model on all training data and predict/impute missing values in X_test_scaled.
     """
     self.model.fit(X_train_and_val, y_train_and_val)
     y_impute = self.model.predict(X_test_scaled)
     self.data.loc[self.nan_ix, self.target_column] = y_impute
     print("Missing values imputed.")
     return self.data

   def run_pipeline(self):
     """
     Runs the entire pipeline: loading data, introducing NaNs, preprocessing,
     splitting, training, validating, and imputing missing values.
     """
     self.load_data()
     self.introduce_random_nans()
     self.drop_high_null_columns()
     self.preprocess_column()

     # Split and scale data
     X_train_scaled, y_train, X_val_scaled, y_val, X_test_scaled = self.split_and_scale()

     # Train and validate
     self.fit_and_validate(X_train_scaled, y_train, X_val_scaled, y_val)

     # Impute missing values
     self.impute_missing_values(X_train_scaled, y_train, X_test_scaled)

     print("nFinal imputed target column values:")
     print(self.data[self.target_column])

dpath = "winequalityred.csv"
target_column = 'sulphates'

imputer = ImputerLightGBM(path, target_column)
imputer.run_pipeline()

The validation RMSE (~0.14) may not be ideal but can be still used for imputation.

Sulphates column after LightGBM imputation

Let's also try kNN model to see whether we will get better results.

2.2. Multivariate Imputation using K-Nearest Neighbors (kNN)

kNN is a supervised learning algorithm which looks to the k closest data points to an instance and predicts the value of the desired output. It can use mean/ median of these k data points if the variables are numeric or mode if the variables are categorical.

To implement the kNN algorithm for the imputation of the same dataset, we can basically follow similar steps with the LGBMRegressor model. This time we should apply standardization or normalization for scaling since it is an important prepocessing step for kNN.

Since kNN algorithm does not accept null values in the training set, we add an additional line into the _preprocess_column_ function:

Build an ImputerKnn class

class ImputerKnn:
    def __init__(self, path, target_column='sulphates'):
        """
        Initialize the pipeline with dataset path and target column name.
        Args:
            path: Path to dataset.
            target_column: Target column name for modeling/imputation.
        """
        self.path = path
        self.target_column = target_column
        self.data = None
        self.nan_indices = None
        self.scaler = StandardScaler()
        self.model = None

    def load_data(self):
        """Load dataset from file."""
        self.data = pd.read_csv(self.path, sep=';', header=0)
        print("Loaded data:n", self.data.head())

    def drop_high_null_columns(self, threshold=0.5):
        """Drop columns with missing values above a certain threshold."""
        cols_to_drop = [col for col in self.data.columns if self.data[col].isnull().sum() > threshold * len(self.data)]
        self.data.drop(columns=cols_to_drop, inplace=True)
        print(f"Dropped columns: {cols_to_drop}")
        return self

    def preprocess_column(self):
        """Handle NaN preprocessing and add an indicator for missing values."""
        data = self.data.copy()
        # Drop rows with NaN values in the target column
        data = data[data.drop(columns=[self.target_column]).notnull().all(axis=1)]
        nan_ix = data.index[data[self.target_column].isna()].tolist()
        data['is_nan'] = 0
        data.loc[nan_ix, 'is_nan'] = 1

        for cat_col in data.select_dtypes(include='object'):
            data[cat_col], _ = pd.factorize(data[cat_col])

        self.data = data
        self.nan_indices = nan_ix
        print("Preprocessing complete.")
        return self

    def split_and_scale(self, val_size=0.2):
        """Split the data into training, validation, and test sets and scale features."""
        features = self.data.drop([self.target_column, 'is_nan'], axis=1)
        target = self.data[self.target_column]

        # Separate missing rows for testing
        X_train_and_val = features.drop(index=self.nan_indices)
        X_test = features.loc[self.nan_indices]
        y_train_and_val = target.drop(index=self.nan_indices)

        # Split into train/validation sets
        X_train, X_val, y_train, y_val = train_test_split(
            X_train_and_val, y_train_and_val, test_size=val_size, random_state=42
        )

        # Scale features
        X_train_scaled = self.scaler.fit_transform(X_train)
        X_val_scaled = self.scaler.transform(X_val)
        X_test_scaled = self.scaler.transform(X_test)

        print("Split and scaling complete.")
        return X_train_scaled, y_train, X_val_scaled, y_val, X_test_scaled

    def fit_knn(self, X_train_scaled, y_train, X_val_scaled, y_val, n_neighbors=5):
        """Train KNN model and validate performance."""
        self.model = KNeighborsRegressor(n_neighbors=n_neighbors)
        self.model.fit(X_train_scaled, y_train)
        y_val_pred = self.model.predict(X_val_scaled)
        rmse = mean_squared_error(y_val, y_val_pred, squared=False)
        print(f"Validation RMSE: {rmse}")
        return rmse

    def impute_missing(self, X_test_scaled):
        """Impute missing values using the trained KNN model."""
        imputed_values = self.model.predict(X_test_scaled)
        self.data.loc[self.nan_indices, self.target_column] = imputed_values
        print("Missing values imputed in test set.")
        return self.data

    def run_pipeline(self):
        """
        Orchestrates the entire pipeline in sequence.
        """
        self.load_data()
        self.introduce_random_nans()
        self.drop_high_null_columns()
        self.preprocess_column()
        X_train_scaled, y_train, X_val_scaled, y_val, X_test_scaled = self.split_and_scale()
        self.fit_knn(X_train_scaled, y_train, X_val_scaled, y_val)
        final_data = self.impute_missing(X_test_scaled)

        print("Pipeline executed successfully.")
        print("Final imputed data:n", final_data['sulphates'])
        return final_data

Run the final imputation:

# Initialize the pipeline
path_to_data = "winequalityred.csv"
pipeline = ImputerKnn(path=path_to_data)

# Run the entire pipeline in one call
final_results = pipeline.run_pipeline()

The RMSE value is ~0.13 which is slightly better than LightGBM Regression.

2.3. Multivariate Imputation using Autoencoders

Autoencoders are unsupervised neural networks designed to learn a compressed representation (latent space) of the data. By reconstructing the input data from this latent space, Autoencoders can effectively model the intricate dependencies between features. Leveraging these learned representations, we can impute missing values by reconstructing them based on the patterns observed in the complete data.

The provided code outlines a step-by-step workflow for imputing missing values using Autoencoders, from data preprocessing and model training to applying the trained Autoencoder for imputation. We do not need to apply preprocessing steps for this one.

Build an ImputerAutoencoder class:

class ImputerAutoencoder:
    def __init__(self, data,
                 recurrent_weight =0.5,
                 optimizer = "adam",
                 dropout_probability = 0.5,
                 hidden_activation = "relu",
                 output_activation = "sigmoid",
                 init = "glorot_normal",
                l1_penalty = 0,
                l2_penalty = 0):
        self.data =data.copy()
        self.recurrent_weight = recurrent_weight
        self.optimizer = optimizer
        self.dropout_probability = dropout_probability
        self.hidden_activation = hidden_activation
        self.output_activation = output_activation
        self.init = init
        self.l1_penalty = l1_penalty
        self.l2_penalty = l2_penalty

    def _get_hidden_layer_sizes(self):
        n_dims =self.data.shape[1]
        return [
            min(2000, 8 * n_dims),
            min(500, 2 * n_dims),
            int(np.ceil(0.5 * n_dims)),
        ]

    def _create_model(self):
        hidden_layer_sizes = self._get_hidden_layer_sizes()
        first_layer_size = hidden_layer_sizes[0]
        n_dims = self.data.shape[1]

        model = Sequential()

        model.add(Dense(
            first_layer_size,
            input_dim =2 * n_dims,
            activation = self.hidden_activation,
            kernel_regularizer=l1_l2(self.l1_penalty, self.l2_penalty),
            kernel_initializer=self.init
            ))

        model.add(Dropout(self.dropout_probability))

        for layer_size in hidden_layer_sizes[1:]:
            model.add(Dense(
                n_dims,
                activation = self.output_activation,
                kernel_regularizer = l1_l2(self.l1_penalty, self.l2_penalty),
                kernel_initializer = self.init
            ))

            loss_fuction = make_reconstruction_loss(n_dims)

            model.compile(optimizer=self.optimizer, loss=loss_fuction)
            return model

    def fill(self, missing_mask):
        self.data[missing_mask] = -1

    def _create_missing_mask(self):
        if self.data.dtype != "f" and self.data.dtype != "d":
            self.data =self.data.astype(float)

        return np.isnan(self.data)

    def _train_epoch(self, model, missing_mask, batch_size):
        input_with_mask = np.hstack([self.data, missing_mask])
        n_samples = len(input_with_mask)
        n_batches = int(np.ceil(n_samples / batch_size))
        indices = np.arange(n_samples)
        np.random.shuffle(indices)
        X_shuffled = input_with_mask[indices]

        for batch_idx in range(n_batches):
            batch_start = batch_idx * batch_size
            batch_end = (batch_idx + 1) * batch_size
            batch_data = X_shuffled[batch_start:batch_end, :]
            model.train_on_batch(batch_data, batch_data)
        return model.predict(input_with_mask)

    def train(self, batch_size=256, train_epochs=100):
        missing_mask = self._create_missing_mask()
        self.fill(missing_mask)
        self.model = self._create_model()

        observed_mask = missing_mask

        for epoch in range(train_epochs):
            X_pred = self._train_epoch(self.model, missing_mask, batch_size)
            observed_mae = masked_mae(X_true=self.data,
                                      X_pred=X_pred,
                                      mask=observed_mask)

            if epoch % 10 == 0:
                print("observed mae:", observed_mae)

            old_weight = (1.0 - self.recurrent_weight)
            self.data[missing_mask] *= old_weight
            pred_missing = X_pred[missing_mask]
            self.data[missing_mask] += self.recurrent_weight * pred_missing

        return self.data.copy()

Run it to predict and impute missing values:

dmissing_encoded = pd.get_dummies(df)

for col in df.columns:
    missing_cols = missing_encoded.columns.str.startswith(str(col) + "_")
    missing_encoded.loc[df[col].isnull(),missing_cols] =np.nan

imputer = ImputerAutoencoder(missing_encoded.values)
complete_encoded =imputer.train(train_epochs=50, batch_size=10)

df['sulphates'] = complete_encoded[:, df.columns.get_loc('sulphates')]
print(df['sulphates'])

Mean Squared Error value ( ~0.007 after 50 epochs) is better than LightGBM and kNN models.

3. Multiple Imputation by Chained Equations (MICE)

MICE is an iterative process applied to estimate the missing values in each variable one by one. It treats that variable as dependent to other variables and use them as the predictors in each step.

Assume we have a dataset like this, where x shows that the cell is not empty, but there is a value provided, and the null cells are the missing values:

In our first step, we predict the null values in A column, using a selected prediction algorithm (i.e. Linear Regression, Logistic Regression, K-nearest Neighbors, etc.) and the values in columns B, C, D, E as predictors.

y represent the predicted values for nulls in A

We repeat this step for the missing values of each variable one by one and get our first estimations. We use the values in A, C, D, E columns to predict null values in B; and so on.

The dataset filled with the predictions after first step

After having the initial predictions, we start to the second iteration. We delete the predictions we did for the column A, and predict them again, this time using the new dataset we created with filled columns for other variables.

New dataset to predict null values in column A

We get the second predictions for the missing values in column A, which are represented with z in the table:

Second iteration to predict null values in A

We repeat this step for the other variables (B, C, D, E) one by one and complete the second iteration.

The dataset after the completion of the second iteration

So, we can repeat the iterations as much as we want, but the mostly used iteration number is 10. In this way, the predicted values get closer to the real ones and the error rate decreases. Instead of choosing an iteration number beforehand, we can also determine an error rate before running the algorithm and after reaching it, we may stop iterating. If we do the latter, the process continues until the imputations stabilize.

You can see other terminologies used for MICE, such as "Fully Conditional Specification" or "Sequential Regression Multiple Imputation" [3] in the literature as well.

As the final note, it is important to remember that applying MICE when data are not MAR could result in biased estimates [3]. Therefore, it should be applied under MAR assumption.

Let's also discuss how we can implement MICE using ScikitLearn IterativeImputer function.

Note: ScikitLearn Iterative Imputer is still in experimental state. There might be some changes in the future.

There are four types of estimators which could be used while applying MICE using ScikitLearn:

Bayesian Ridge
Random Forest Regressor
Make Pipeline (Nystoem, Ridge)
K Neighbors Regressor [4].

You can check the ScikitLearn documentation for more information. In the following code, you can see how we can apply IterativeImputer with Bayesian Ridge, using the same structure of code.

Build an ImputerMICE class:

class ImputerMICE:

    def __init__(self, path, target_column='sulphates'):

        self.path = path
        self.target_column = target_column
        self.data = None
        self.scaler = StandardScaler()
        self.nan_indices = None

    def load_data(self):
        self.data = pd.read_csv(self.path, sep=';', header=0)
        print("Loaded data:n", self.data.head())
        return self

    def drop_high_null_columns(self, threshold=0.5):
        # Drop columns with more than the threshold proportion of missing values.
        cols_to_drop = [col for col in self.data.columns if self.data[col].isnull().sum() > threshold * len(self.data)]
        self.data.drop(columns=cols_to_drop, inplace=True)
        print(f"Dropped columns: {cols_to_drop}")
        return self

    def preprocess_column(self):
        #Handle NaN preprocessing and prepare data for iterative imputation.
        data = self.data.copy()
        # Drop rows with NaN values in the target column
        data = data[data.drop(columns=[self.target_column]).notnull().all(axis=1)]
        self.nan_indices = data.index[data[self.target_column].isna()].tolist()

        # Add an indicator for rows that originally had NaNs
        data['is_nan'] = 0
        data.loc[self.nan_indices, 'is_nan'] = 1

        for cat_col in data.select_dtypes(include='object'):
            data[cat_col], _ = pd.factorize(data[cat_col])

        self.data = data
        print("Preprocessing complete.")
        return self

    def split_and_scale(self, val_size=0.2):
        # Split dataset into training/validation/test splits and scale features.
        features = self.data.drop([self.target_column, 'is_nan'], axis=1)
        target = self.data[self.target_column]

        # Separate rows with Missing Data for testing purposes
        X_train_and_val = features.drop(index=self.nan_indices)
        X_test = features.loc[self.nan_indices]
        y_train_and_val = target.drop(index=self.nan_indices)

        # Split the data into training and validation sets
        X_train, X_val, y_train, y_val = train_test_split(
            X_train_and_val, y_train_and_val, test_size=val_size, random_state=42
        )

        # Scale the features
        X_train_scaled = self.scaler.fit_transform(X_train)
        X_val_scaled = self.scaler.transform(X_val)
        X_test_scaled = self.scaler.transform(X_test)

        print("Split and scaling complete.")
        return X_train_scaled, y_train, X_val_scaled, y_val, X_test_scaled

    def apply_iterative_imputation(self, X_train_scaled, X_val_scaled, X_test_scaled, col_index, estimator=None,
                                   max_iter=10, random_state=42):

        # Applies Iterative Imputation using the specified estimator and updates missing values for a given column.

        if estimator is None:
            estimator = BayesianRidge()

        imputer = IterativeImputer(estimator=estimator, max_iter=max_iter, random_state=random_state)

        # Fit and transform training data and then transform validation/test data
        imputer.fit(X_train_scaled)
        X_train_imputed = imputer.transform(X_train_scaled)
        X_val_imputed = imputer.transform(X_val_scaled)
        X_test_imputed = imputer.transform(X_test_scaled)

        return X_test_imputed[:, col_index]

    def run_iterative_imputation_workflow(self, col_index, estimator=None, max_iter=10, random_state=42):
        #Executes the iterative imputation workflow for a specific column.
        # Preprocess and scale splits
        X_train_scaled, y_train, X_val_scaled, y_val, X_test_scaled = self.split_and_scale()

        # Apply iterative imputation
        imputed_values = self.apply_iterative_imputation(
            X_train_scaled, X_val_scaled, X_test_scaled, col_index,
            estimator=estimator, max_iter=max_iter, random_state=random_state
        )

        # Update missing data in the original DataFrame
        self.data.loc[self.nan_indices, self.target_column] = imputed_values
        print("Iterative imputation workflow complete.")
        return self.data

Run it to predict and impute missing values:

mice_instance = ImputerMICE(path='winequalityred.csv', target_column='sulphates')
mice_instance.load_data().introduce_random_nans().drop_high_null_columns().preprocess_column()
updated_data = mice_instance.run_iterative_imputation_workflow(col_index=3)

Conclusion

It is important to keep in mind that the original data can not be recovered no matter the imputation technique. However, it is possible to use techniques that can generate imputed data sets that are as close as possible to reality [5].

Throughout this article, we have explored various multivariate imputation methodologies and also how to implement iterative imputation. By tailoring these techniques, we can minimize the bias and preserve the data integrity to increase our ultimate model's performance or come up with an accurate analysis.

In the next article, we will discuss on how to determine which method works best for our dataset by comparing the results of different imputation techniques. Stay tuned!

All images, unless otherwise noted, are by the author.

References

[1] Red Wine Quality Data, Wine Quality – UCI Machine Learning Repository, June 10, 2009 (CC BY 4.0).

[2] Mera-Gaona et. al., Evaluating the impact of multivariate imputation by MICE in feature selection, https://doi.org/10.1371/journal.pone.0254720, 2021.

[3] Azur, Melissa J et al. "Multiple imputation by chained equations: what is it and how does it work?." International journal of methods in psychiatric research vol. 20,1 (2011): 40–9. doi:10.1002/mpr.329

[4] Imputing missing values with variants of IterativeImputer. www.scikit-learn.org

[5] Kelta, Zoumana. Top Techniques to Handle Missing Values Every Data Scientist Should Know. www.datacamp.com, 2023.

Tags: Data Science Hands On Tutorials Imputation Lightgbm Missing Data