Missing Value Imputation, Explained: A Visual Guide with Code Examples for Beginners
DATA PREPROCESSING

⛳️ More [Data Preprocessing](https://medium.com/@samybaladram/list/data-preprocessing-17a2c49b44e4), explained: ▶ [Missing Value Imputation](https://towardsdatascience.com/missing-value-imputation-explained-a-visual-guide-with-code-examples-for-beginners-93e0726284eb) · [Categorical Encoding](https://towardsdatascience.com/encoding-categorical-data-explained-a-visual-guide-with-code-example-for-beginners-b169ac4193ae) · [Data Scaling](https://towardsdatascience.com/scaling-numerical-data-explained-a-visual-guide-with-code-examples-for-beginners-11676cdb45cb) · [Discretization](https://towardsdatascience.com/discretization-explained-a-visual-guide-with-code-examples-for-beginners-f056af9102fa?gi=c1bf25229f86) · [Oversampling & Undersampling](https://towardsdatascience.com/oversampling-and-undersampling-explained-a-visual-guide-with-mini-2d-dataset-1155577d3091) · [Data Leakage in Preprocessing](https://towardsdatascience.com/data-leakage-in-preprocessing-explained-a-visual-guide-with-code-examples-33cbf07507b7)
Let's talk about something that every data scientist, analyst, or curious number-cruncher has to deal with sooner or later: missing values. Now, I know what you're thinking – "Oh great, another missing value guide." But hear me out. I'm going to show you how to tackle this problem using not one, not two, but six different imputation methods, all on a single dataset (with helpful visuals as well!). By the end of this, you'll see why domain knowledge is worth its weight in gold (something even our AI friends might struggle to replicate).

What Are Missing Values and Why Do They Occur?
Before we get into our dataset and imputation methods, let's take a moment to understand what missing values are and why they're such a common headache in data science.
What Are Missing Values?
Missing values, often represented as NaN (Not a Number) in pandas or NULL in databases, are essentially holes in your dataset. They're the empty cells in your spreadsheet, the blanks in your survey responses, the data points that got away. In the world of data, not all absences are created equal, and understanding the nature of your missing values is crucial for deciding how to handle them.

Why Do Missing Values Occur?
Missing values can sneak into your data for a variety of reasons. Here are some common reasons:
- Data Entry Errors: Sometimes, it's just human error. Someone might forget to input a value or accidentally delete one.
- Sensor Malfunctions: In IoT or scientific experiments, a faulty sensor might fail to record data at certain times.
- Survey Non-Response: In surveys, respondents might skip questions they're uncomfortable answering or don't understand.
- Merged Datasets: When combining data from multiple sources, some entries might not have corresponding values in all datasets.
- Data Corruption: During data transfer or storage, some values might get corrupted and become unreadable.
- Intentional Omissions: Some data might be intentionally left out due to privacy concerns or irrelevance.
- Sampling Issues: The data collection method might systematically miss certain types of data.
- Time-Sensitive Data: In time series data, values might be missing for periods when data wasn't collected (e.g., weekends, holidays).
Types of Missing Data
Understanding the type of missing data you're dealing with can help you choose the most appropriate imputation method. Statisticians generally categorize missing data into three types:
- Missing Completely at Random (MCAR): The missingness is totally random and doesn't depend on any other variable. For example, if a lab sample was accidentally dropped.
- Missing at Random (MAR): The probability of missing data depends on other observed variables but not on the missing data itself. For example, men might be less likely to answer questions about emotions in a survey.
- Missing Not at Random (MNAR): The missingness depends on the value of the missing data itself. For example, people with high incomes might be less likely to report their income in a survey.

Why Care About Missing Values?
Missing values can significantly impact your analysis:
- They can introduce bias if not handled properly.
- Many Machine Learning algorithms can't handle missing values out of the box.
- They can lead to loss of important information if instances with missing values are simply discarded.
- Improperly handled Missing Values can lead to incorrect conclusions or predictions.
That's why it's crucial to have a solid strategy for dealing with missing values. And that's exactly what we're going to explore in this article!
The Dataset
First things first, let's introduce our dataset. We'll be working with a golf course dataset that tracks various factors affecting the crowdedness of the course. This dataset has a bit of everything – numerical data, categorical data, and yes, plenty of missing values.

import pandas as pd
import numpy as np
# Create the dataset as a dictionary
data = {
'Date': ['08-01', '08-02', '08-03', '08-04', '08-05', '08-06', '08-07', '08-08', '08-09', '08-10',
'08-11', '08-12', '08-13', '08-14', '08-15', '08-16', '08-17', '08-18', '08-19', '08-20'],
'Weekday': [0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5],
'Holiday': [0.0, 0.0, 0.0, 0.0, np.nan, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, np.nan, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
'Temp': [25.1, 26.4, np.nan, 24.1, 24.7, 26.5, 27.6, 28.2, 27.1, 26.7, np.nan, 24.3, 23.1, 22.4, np.nan, 26.5, 28.6, np.nan, 27.0, 26.9],
'Humidity': [99.0, np.nan, 96.0, 68.0, 98.0, 98.0, 78.0, np.nan, 70.0, 75.0, np.nan, 77.0, 77.0, 89.0, 80.0, 88.0, 76.0, np.nan, 73.0, 73.0],
'Wind': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, np.nan, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, np.nan, 1.0, 0.0],
'Outlook': ['rainy', 'sunny', 'rainy', 'overcast', 'rainy', np.nan, 'rainy', 'rainy', 'overcast', 'sunny', np.nan, 'overcast', 'sunny', 'rainy', 'sunny', 'rainy', np.nan, 'rainy', 'overcast', 'sunny'],
'Crowdedness': [0.14, np.nan, 0.21, 0.68, 0.20, 0.32, 0.72, 0.61, np.nan, 0.54, np.nan, 0.67, 0.66, 0.38, 0.46, np.nan, 0.52, np.nan, 0.62, 0.81]
}
# Create a DataFrame from the dictionary
df = pd.DataFrame(data)
# Display basic information about the dataset
print(df.info())
# Display the first few rows of the dataset
print(df.head())
# Display the count of missing values in each column
print(df.isnull().sum())
Output:
RangeIndex: 20 entries, 0 to 19
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 20 non-null object
1 Weekday 20 non-null int64
2 Holiday 19 non-null float64
3 Temp 16 non-null float64
4 Humidity 17 non-null float64
5 Wind 19 non-null float64
6 Outlook 17 non-null object
7 Crowdedness 15 non-null float64
dtypes: float64(5), int64(1), object(2)
memory usage: 1.3+ KB
Date Weekday Holiday Temp Humidity Wind Outlook Crowdedness
0 08-01 0 0.0 25.1 99.0 0.0 rainy 0.14
1 08-02 1 0.0 26.4 NaN 0.0 sunny NaN
2 08-03 2 0.0 NaN 96.0 0.0 rainy 0.21
3 08-04 3 0.0 24.1 68.0 0.0 overcast 0.68
4 08-05 4 NaN 24.7 98.0 0.0 rainy 0.20
Date 0
Weekday 0
Holiday 1
Temp 4
Humidity 3
Wind 1
Outlook 3
Crowdedness 5
dtype: int64
As we can see, our dataset contains 20 rows and 8 columns:
- Date: The date of the observation
- Weekday: Day of the week (0–6, where 0 is Monday)
- Holiday: Boolean indicating if it's a holiday (0 or 1)
- Temp: Temperature in Celsius
- Humidity: Humidity percentage
- Wind: Wind condition (0 or 1, possibly indicating calm or windy)
- Outlook: Weather outlook (sunny, overcast, or rainy)
- Crowdedness: Percentage of course occupancy
And look at that! We've got missing values in every column except Date and Weekday. Perfect for our imputation party.
Now that we have our dataset loaded, let's tackle these missing values with six different imputation methods. We'll use a different strategy for each type of data.
Method 1: Listwise Deletion
Listwise deletion, also known as complete case analysis, involves removing entire rows that contain any missing values. This method is simple and preserves the distribution of the data, but it can lead to a significant loss of information if many rows contain missing values.