Encoding Categorical Data, Explained: A Visual Guide with Code Example for Beginners

Author:Murphy | View: 25638 | Time: 2025-03-23 11:33:44

DATA PREPROCESSING

⛳️ More [DATA PREPROCESSING](https://medium.com/@samybaladram/list/data-preprocessing-17a2c49b44e4), explained: · [Missing Value Imputation](https://towardsdatascience.com/missing-value-imputation-explained-a-visual-guide-with-code-examples-for-beginners-93e0726284eb) ▶ [Categorical Encoding](https://towardsdatascience.com/encoding-categorical-data-explained-a-visual-guide-with-code-example-for-beginners-b169ac4193ae) · [Data Scaling](https://towardsdatascience.com/scaling-numerical-data-explained-a-visual-guide-with-code-examples-for-beginners-11676cdb45cb) · [Discretization](https://towardsdatascience.com/discretization-explained-a-visual-guide-with-code-examples-for-beginners-f056af9102fa?gi=c1bf25229f86) · [Oversampling & Undersampling](https://towardsdatascience.com/oversampling-and-undersampling-explained-a-visual-guide-with-mini-2d-dataset-1155577d3091) · [Data Leakage in Preprocessing](https://towardsdatascience.com/data-leakage-in-preprocessing-explained-a-visual-guide-with-code-examples-33cbf07507b7)

Ah, Categorical Data – the colorful characters in our datasets that machines just can't seem to understand. This is where "red" becomes 1, "blue" 2, and data scientists turn into language translators (or more like matchmakers?).

Now, I know what you're thinking: "Encoding? Isn't that just assigning numbers to categories?" Oh, if only it were that simple! We're about to explore six different encoding methods, all on (again) a single, tiny dataset (with visuals, of course!) From simple labels to mind-bending cyclic transformations, you'll see why choosing the right encoding can be as important as picking the perfect algorithm.

All visuals: Author-created using Canva Pro. Optimized for mobile; may appear oversized on desktop.

What Is Categorical Data and Why Does It Need Encoding?

Before we jump into our dataset and encoding methods, let's take a moment to understand what categorical data is and why it needs special treatment in the world of machine learning.

What Is Categorical Data?

Categorical data is like the descriptive labels we use in everyday life. It represents characteristics or qualities that can be grouped into categories.

Why Does Categorical Data Need Encoding?

Here's the catch: most machine learning algorithms are like picky eaters – they only digest numbers. They can't directly understand that "sunny" is different from "rainy". That's where encoding comes in. It's like translating these categories into a language that machines can understand and work with.

Types of Categorical Data

Not all categories are created equal. We generally have two types:

Nominal: These are categories with no inherent order. _Ex: "_Outlook" (sunny, overcast, rainy) is nominal. There's no natural ranking between these weather conditions.
Ordinal: These categories have a meaningful order. Ex: "Temperature" (Very Low, Low, High, Very High) is ordinal. There's a clear progression from coldest to hottest.

Why Care About Proper Encoding?

It preserves important information in your data.
It can significantly impact your model's performance.
Incorrect encoding can introduce unintended biases or relationships.

Imagine if we encoded "sunny" as 1 and "rainy" as 2. The model might think rainy days are "greater than" sunny days, which isn't what we want!

Now that we understand what categorical data is and why it needs encoding, let's take a look at our dataset and see how we can tackle its categorical variables using six different encoding methods.

The Dataset

Let's use a simple golf dataset to illustrate our encoding methods (and it has mostly categorical columns). This dataset records various weather conditions and the resulting crowdedness at a golf course.

import pandas as pd
import numpy as np

data = {
    'Date': ['03-25', '03-26', '03-27', '03-28', '03-29', '03-30', '03-31', '04-01', '04-02', '04-03', '04-04', '04-05'],
    'Weekday': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri'],
    'Month': ['Mar', 'Mar', 'Mar', 'Mar', 'Mar', 'Mar', 'Mar', 'Apr', 'Apr', 'Apr', 'Apr', 'Apr'],
    'Temperature': ['High', 'Low', 'High', 'Extreme', 'Low', 'High', 'High', 'Low', 'High', 'Extreme', 'High', 'Low'],
    'Humidity': ['Dry', 'Humid', 'Dry', 'Dry', 'Humid', 'Humid', 'Dry', 'Humid', 'Dry', 'Dry', 'Humid', 'Dry'],
    'Wind': ['No', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'No', 'Yes', 'Yes', 'No', 'Yes'],
    'Outlook': ['sunny', 'rainy', 'overcast', 'sunny', 'rainy', 'overcast', 'sunny', 'rainy', 'sunny', 'overcast', 'sunny', 'rainy'],
    'Crowdedness': [85, 30, 65, 45, 25, 90, 95, 35, 70, 50, 80, 45]
}
# Create a DataFrame from the dictionary
df = pd.DataFrame(data)

As we can see, we have a lot of categorical variables. Our task is to encode these variables so that a machine learning model can use them to predict, say, the Crowdedness of the golf course.

Let's get into it.

Method 1: Label Encoding

Label Encoding assigns a unique integer to each category in a categorical variable.

Common Use

Tags: Categorical Data Categorical Encoding Data Science Machine Learning Programming

Add Fav

Comment

Murphy

Add friends

View space

Message

Recommend

◦ XPER: Unveiling the Driving Forces of Predictive Performance

◦ Intro to Docker Containers for Data Scientists

◦ Using DeepFace for Face Recognition

◦ The Price of Gold: Is Olympic Success Reserved for the Wealthy?

◦ Probabilistic Forecasting of Binary Events using Regression

◦ Forecasting with NHiTs: Uniting Deep Learning + Signal Processing Theory for Superior Accuracy

◦ QLoRA – How to Fine-Tune an LLM on a Single GPU

◦ End-of-Year Report on a 12-Year Data Journey

◦ Studying the Gender Wage Gap in the US Using Distributional Random Forests

◦ The Power of Transformers in Predicting Twitter Account Identities

◦ Deploying Falcon-7B Into Production

◦ Automatic Differentiation (AutoDiff): A Brief Intro with Examples