Missing Value Imputation, Explained: A Visual Guide with Code Examples for Beginners

Author:Murphy | View: 24076 | Time: 2025-03-23 11:41:47

DATA PREPROCESSING

⛳️ More [Data Preprocessing](https://medium.com/@samybaladram/list/data-preprocessing-17a2c49b44e4), explained: ▶ [Missing Value Imputation](https://towardsdatascience.com/missing-value-imputation-explained-a-visual-guide-with-code-examples-for-beginners-93e0726284eb) · [Categorical Encoding](https://towardsdatascience.com/encoding-categorical-data-explained-a-visual-guide-with-code-example-for-beginners-b169ac4193ae) · [Data Scaling](https://towardsdatascience.com/scaling-numerical-data-explained-a-visual-guide-with-code-examples-for-beginners-11676cdb45cb) · [Discretization](https://towardsdatascience.com/discretization-explained-a-visual-guide-with-code-examples-for-beginners-f056af9102fa?gi=c1bf25229f86) · [Oversampling & Undersampling](https://towardsdatascience.com/oversampling-and-undersampling-explained-a-visual-guide-with-mini-2d-dataset-1155577d3091) · [Data Leakage in Preprocessing](https://towardsdatascience.com/data-leakage-in-preprocessing-explained-a-visual-guide-with-code-examples-33cbf07507b7)

Let's talk about something that every data scientist, analyst, or curious number-cruncher has to deal with sooner or later: missing values. Now, I know what you're thinking – "Oh great, another missing value guide." But hear me out. I'm going to show you how to tackle this problem using not one, not two, but six different imputation methods, all on a single dataset (with helpful visuals as well!). By the end of this, you'll see why domain knowledge is worth its weight in gold (something even our AI friends might struggle to replicate).

All visuals: Author-created using Canva Pro. Optimized for mobile; may appear oversized on desktop.

What Are Missing Values and Why Do They Occur?

Before we get into our dataset and imputation methods, let's take a moment to understand what missing values are and why they're such a common headache in data science.

What Are Missing Values?

Missing values, often represented as NaN (Not a Number) in pandas or NULL in databases, are essentially holes in your dataset. They're the empty cells in your spreadsheet, the blanks in your survey responses, the data points that got away. In the world of data, not all absences are created equal, and understanding the nature of your missing values is crucial for deciding how to handle them.

Why Do Missing Values Occur?

Missing values can sneak into your data for a variety of reasons. Here are some common reasons:

Data Entry Errors: Sometimes, it's just human error. Someone might forget to input a value or accidentally delete one.
Sensor Malfunctions: In IoT or scientific experiments, a faulty sensor might fail to record data at certain times.
Survey Non-Response: In surveys, respondents might skip questions they're uncomfortable answering or don't understand.
Merged Datasets: When combining data from multiple sources, some entries might not have corresponding values in all datasets.
Data Corruption: During data transfer or storage, some values might get corrupted and become unreadable.
Intentional Omissions: Some data might be intentionally left out due to privacy concerns or irrelevance.
Sampling Issues: The data collection method might systematically miss certain types of data.
Time-Sensitive Data: In time series data, values might be missing for periods when data wasn't collected (e.g., weekends, holidays).

Types of Missing Data

Understanding the type of missing data you're dealing with can help you choose the most appropriate imputation method. Statisticians generally categorize missing data into three types:

Missing Completely at Random (MCAR): The missingness is totally random and doesn't depend on any other variable. For example, if a lab sample was accidentally dropped.
Missing at Random (MAR): The probability of missing data depends on other observed variables but not on the missing data itself. For example, men might be less likely to answer questions about emotions in a survey.
Missing Not at Random (MNAR): The missingness depends on the value of the missing data itself. For example, people with high incomes might be less likely to report their income in a survey.

Why Care About Missing Values?

Missing values can significantly impact your analysis:

They can introduce bias if not handled properly.
Many Machine Learning algorithms can't handle missing values out of the box.
They can lead to loss of important information if instances with missing values are simply discarded.
Improperly handled Missing Values can lead to incorrect conclusions or predictions.

That's why it's crucial to have a solid strategy for dealing with missing values. And that's exactly what we're going to explore in this article!

The Dataset

First things first, let's introduce our dataset. We'll be working with a golf course dataset that tracks various factors affecting the crowdedness of the course. This dataset has a bit of everything – numerical data, categorical data, and yes, plenty of missing values.

This dataset is artificially made by the author (inspired by [1]) to promote learning.

import pandas as pd
import numpy as np

# Create the dataset as a dictionary
data = {
    'Date': ['08-01', '08-02', '08-03', '08-04', '08-05', '08-06', '08-07', '08-08', '08-09', '08-10',
             '08-11', '08-12', '08-13', '08-14', '08-15', '08-16', '08-17', '08-18', '08-19', '08-20'],
    'Weekday': [0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5],
    'Holiday': [0.0, 0.0, 0.0, 0.0, np.nan, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, np.nan, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
    'Temp': [25.1, 26.4, np.nan, 24.1, 24.7, 26.5, 27.6, 28.2, 27.1, 26.7, np.nan, 24.3, 23.1, 22.4, np.nan, 26.5, 28.6, np.nan, 27.0, 26.9],
    'Humidity': [99.0, np.nan, 96.0, 68.0, 98.0, 98.0, 78.0, np.nan, 70.0, 75.0, np.nan, 77.0, 77.0, 89.0, 80.0, 88.0, 76.0, np.nan, 73.0, 73.0],
    'Wind': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, np.nan, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, np.nan, 1.0, 0.0],
    'Outlook': ['rainy', 'sunny', 'rainy', 'overcast', 'rainy', np.nan, 'rainy', 'rainy', 'overcast', 'sunny', np.nan, 'overcast', 'sunny', 'rainy', 'sunny', 'rainy', np.nan, 'rainy', 'overcast', 'sunny'],
    'Crowdedness': [0.14, np.nan, 0.21, 0.68, 0.20, 0.32, 0.72, 0.61, np.nan, 0.54, np.nan, 0.67, 0.66, 0.38, 0.46, np.nan, 0.52, np.nan, 0.62, 0.81]
}

# Create a DataFrame from the dictionary
df = pd.DataFrame(data)

# Display basic information about the dataset
print(df.info())

# Display the first few rows of the dataset
print(df.head())

# Display the count of missing values in each column
print(df.isnull().sum())

Output:


RangeIndex: 20 entries, 0 to 19
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Date         20 non-null     object 
 1   Weekday      20 non-null     int64  
 2   Holiday      19 non-null     float64
 3   Temp         16 non-null     float64
 4   Humidity     17 non-null     float64
 5   Wind         19 non-null     float64
 6   Outlook      17 non-null     object 
 7   Crowdedness  15 non-null     float64
dtypes: float64(5), int64(1), object(2)
memory usage: 1.3+ KB

     Date  Weekday  Holiday  Temp  Humidity  Wind Outlook  Crowdedness
0  08-01        0      0.0  25.1      99.0   0.0   rainy         0.14
1  08-02        1      0.0  26.4       NaN   0.0   sunny          NaN
2  08-03        2      0.0   NaN      96.0   0.0   rainy         0.21
3  08-04        3      0.0  24.1      68.0   0.0   overcast      0.68
4  08-05        4      NaN  24.7      98.0   0.0   rainy         0.20

Date           0
Weekday        0
Holiday        1
Temp           4
Humidity       3
Wind           1
Outlook        3
Crowdedness    5
dtype: int64

As we can see, our dataset contains 20 rows and 8 columns:

Date: The date of the observation
Weekday: Day of the week (0–6, where 0 is Monday)
Holiday: Boolean indicating if it's a holiday (0 or 1)
Temp: Temperature in Celsius
Humidity: Humidity percentage
Wind: Wind condition (0 or 1, possibly indicating calm or windy)
Outlook: Weather outlook (sunny, overcast, or rainy)
Crowdedness: Percentage of course occupancy

And look at that! We've got missing values in every column except Date and Weekday. Perfect for our imputation party.

Now that we have our dataset loaded, let's tackle these missing values with six different imputation methods. We'll use a different strategy for each type of data.

Method 1: Listwise Deletion

Listwise deletion, also known as complete case analysis, involves removing entire rows that contain any missing values. This method is simple and preserves the distribution of the data, but it can lead to a significant loss of information if many rows contain missing values.

Tags: Beginners Guide Data Preprocessing Machine Learning Missing Values Tips And Tricks

Add Fav

0

Comment

Publish
O(∩_∩)O
( ◔ ڼ ◔ )
*@_@*
⊙﹏⊙‖∣°
→_→
(⊙_⊙;)
o_o
///^_^
( ^_^ )?
(+_+)?
@_@a
（*@ο@*）
O_o
（⊙o⊙）
★~★
π_π
(*^‧^*)
(^人^)
(^_^)/
~w_w~
（vˍv）
（*>.<*）
~`o`~
~>_<~+
（＋﹏＋）
( T___T )
(⊙o⊙)
(*>.<*)
Emoji Emoticon
Add
Default Time Hot My Comments
No reviews yet

Murphy

Add friends

View space

Message

Recommend

◦ A Deep Dive into In-Context Learning

◦ Conversations as Directed Graphs with LangChain

◦ Similarity Search, Part 6: Random Projections with LSH Forest

◦ Embracing Simplicity and Composability in Data Engineering

◦ Ablation Testing Neural Networks: The Compensatory Masquerade

◦ Producing insights with Generalized Additive Models GAMs

◦ A Simple CI/CD Setup for ML Projects

◦ Python Quirks: Understand How a Variable Can Be Modified by a Function That Doesn't Return Any

◦ Training Language Models on Google Colab

◦ Art Guard: Protecting Your Online Images From Generative AI

◦ AI Agents: The Intersection of Tool Calling and Reasoning in Generative AI

◦ Set These Boundaries for a Better-Quality Work-Life Balance as a Data Scientist In 2024