Encoding Categorical Data, Explained: A Visual Guide with Code Example for Beginners
DATA PREPROCESSING

⛳️ More [DATA PREPROCESSING](https://medium.com/@samybaladram/list/data-preprocessing-17a2c49b44e4), explained: · [Missing Value Imputation](https://towardsdatascience.com/missing-value-imputation-explained-a-visual-guide-with-code-examples-for-beginners-93e0726284eb) ▶ [Categorical Encoding](https://towardsdatascience.com/encoding-categorical-data-explained-a-visual-guide-with-code-example-for-beginners-b169ac4193ae) · [Data Scaling](https://towardsdatascience.com/scaling-numerical-data-explained-a-visual-guide-with-code-examples-for-beginners-11676cdb45cb) · [Discretization](https://towardsdatascience.com/discretization-explained-a-visual-guide-with-code-examples-for-beginners-f056af9102fa?gi=c1bf25229f86) · [Oversampling & Undersampling](https://towardsdatascience.com/oversampling-and-undersampling-explained-a-visual-guide-with-mini-2d-dataset-1155577d3091) · [Data Leakage in Preprocessing](https://towardsdatascience.com/data-leakage-in-preprocessing-explained-a-visual-guide-with-code-examples-33cbf07507b7)
Ah, Categorical Data – the colorful characters in our datasets that machines just can't seem to understand. This is where "red" becomes 1, "blue" 2, and data scientists turn into language translators (or more like matchmakers?).
Now, I know what you're thinking: "Encoding? Isn't that just assigning numbers to categories?" Oh, if only it were that simple! We're about to explore six different encoding methods, all on (again) a single, tiny dataset (with visuals, of course!) From simple labels to mind-bending cyclic transformations, you'll see why choosing the right encoding can be as important as picking the perfect algorithm.

What Is Categorical Data and Why Does It Need Encoding?
Before we jump into our dataset and encoding methods, let's take a moment to understand what categorical data is and why it needs special treatment in the world of machine learning.
What Is Categorical Data?
Categorical data is like the descriptive labels we use in everyday life. It represents characteristics or qualities that can be grouped into categories.
Why Does Categorical Data Need Encoding?
Here's the catch: most machine learning algorithms are like picky eaters – they only digest numbers. They can't directly understand that "sunny" is different from "rainy". That's where encoding comes in. It's like translating these categories into a language that machines can understand and work with.
Types of Categorical Data
Not all categories are created equal. We generally have two types:
-
Nominal: These are categories with no inherent order. _Ex: "_Outlook" (sunny, overcast, rainy) is nominal. There's no natural ranking between these weather conditions.
-
Ordinal: These categories have a meaningful order. Ex: "Temperature" (Very Low, Low, High, Very High) is ordinal. There's a clear progression from coldest to hottest.

Why Care About Proper Encoding?
- It preserves important information in your data.
- It can significantly impact your model's performance.
- Incorrect encoding can introduce unintended biases or relationships.
Imagine if we encoded "sunny" as 1 and "rainy" as 2. The model might think rainy days are "greater than" sunny days, which isn't what we want!
Now that we understand what categorical data is and why it needs encoding, let's take a look at our dataset and see how we can tackle its categorical variables using six different encoding methods.
The Dataset
Let's use a simple golf dataset to illustrate our encoding methods (and it has mostly categorical columns). This dataset records various weather conditions and the resulting crowdedness at a golf course.

import pandas as pd
import numpy as np
data = {
'Date': ['03-25', '03-26', '03-27', '03-28', '03-29', '03-30', '03-31', '04-01', '04-02', '04-03', '04-04', '04-05'],
'Weekday': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri'],
'Month': ['Mar', 'Mar', 'Mar', 'Mar', 'Mar', 'Mar', 'Mar', 'Apr', 'Apr', 'Apr', 'Apr', 'Apr'],
'Temperature': ['High', 'Low', 'High', 'Extreme', 'Low', 'High', 'High', 'Low', 'High', 'Extreme', 'High', 'Low'],
'Humidity': ['Dry', 'Humid', 'Dry', 'Dry', 'Humid', 'Humid', 'Dry', 'Humid', 'Dry', 'Dry', 'Humid', 'Dry'],
'Wind': ['No', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'No', 'Yes', 'Yes', 'No', 'Yes'],
'Outlook': ['sunny', 'rainy', 'overcast', 'sunny', 'rainy', 'overcast', 'sunny', 'rainy', 'sunny', 'overcast', 'sunny', 'rainy'],
'Crowdedness': [85, 30, 65, 45, 25, 90, 95, 35, 70, 50, 80, 45]
}
# Create a DataFrame from the dictionary
df = pd.DataFrame(data)
As we can see, we have a lot of categorical variables. Our task is to encode these variables so that a machine learning model can use them to predict, say, the Crowdedness of the golf course.
Let's get into it.
Method 1: Label Encoding
Label Encoding assigns a unique integer to each category in a categorical variable.
Common Use