Cleaning a Messy Car Dataset with Python Pandas

Author:Murphy | View: 22945 | Time: 2025-03-23 12:23:01

(image created by author with Midjourney)

The web is a highly-valuable asset as a data source. For instance, a substantial amount of training data used to create large language models comes from the web.

However, it's usually not in the most suitable format. Web data is mainly unstructured (i.e. in the form of free text). Even if it has a predefined structure, web data requires lots of cleaning and preprocessing before it can be used for analytical purposes.

In this article, we'll take a messy dataset that includes the price and some other attributes of cars and clean it using the pandas library.

You can download the dataset from my datasets repository if you want to follow along and execute the code yourself. It's called "mock_car_dataset". Some of the operations we'll perform on this messy dataset are as follows:

String manipulation
Handling data types
Filtering based on strings
Replacing values
Updating column values using other columns
Formatting numerical data
Analyzing the data to detect issues

I created the dataset with mock data. However, it's just like a car dataset you'd scrape from the web. I know it because I've done it before.

The dataset is in CSV format. Let's start with creating a pandas DataFrame from this file.

import pandas as pd

# create DataFrame
cars = pd.read_csv("mock_car_dataset.csv")

# check the shape
cars.shape 
# output
(20, 6)

The dataset contains 20 rows and 6 columns, which means we have data of 6 attributes for 20 cars. Although it's a small dataset, the operations we'll do can easily be applied to much larger datasets (i.e. hundreds of thousands of rows)

Let's see what these attributes are ( cars.head() will show you the following ):

The first 5 rows of the cars DataFrame (image by author)

The next step is to check the data types. We can do so by using the dtypes method:

cars.dtypes

# output
price        object
year          int64
make         object
model        object
cylinders    object
odometer      int64
dtype: object

Some of these columns should definitely be converted to a numeric data type but we can't do it in their current format. For instance, it's not possible to directly convert the "$11,250" to an integer value of 11250. Similarly, the "6 cylinders" value needs some touches to be converted to 6.

Pandas is a highly versatile tool for string manipulation. The operations that seem to be complex can be done in a single line of code.

For instance, to convert the price values to integer data type, we first need to remove "$" and "," from price values and then apply the astype function to change the data type. Assuming the first character of all price values is "$", we can perform this operation as follows:

cars["price"] = cars["price"].str[1:].str.replace(",", "").astype("int")

str[1:] select all characters starting from the second one, which means removing "$". Then we replace "," with nothing and finally change the data type. But, let's first make sure the first characters of all price values are "$".

# select only the first characters in the price column and check unique values
cars.price.str[0].unique()

# output
array(['$'], dtype=object)

We can now use the following line of code to fix the price column:

cars["price"] = cars["price"].str[1:].str.replace(",", "").astype("int")

The next column to handle is the cylinders column. Let's see the value distribution:

cars["cylinders"].value_counts()

# output
6 cylinders    9
8 cylinders    5
4 cylinders    4
other          2
Name: cylinders, dtype: int64

The value_counts function shows the unique values in a column along with their number of occurrences. What we need is to remove " cylinders" from the values and replace the "other" with a value of our choice. I prefer to use the most frequent value as a replacement for "other", which is 6.

Finally, we can change the data type of this column to integer. The following line of code does all these operations:

cars["cylinders"] = cars["cylinders"].str.replace(" cylinders", "").replace("other", 8)

# check the value distribution again
cars["cylinders"].value_counts()

# output
6    9
8    5
4    4
6    2
Name: cylinders, dtype: int64

In the first line of the code snippet above, you may notice that we first used str.replace and then used replace .

str.replace is used for replacing a sequence of characters in a string
replace is used for entire values in a column

Let's take a look at how the DataFrame looks now:

The make column contains both uppercase and lowercase letters, which may cause the same values to be considered as different. Here is a quick test to confirm this:

cars[cars["make"].str.contains("onda")]["make"].value_counts()

# output
honda    1
Honda    1
Name: make, dtype: int64

"Honda" and "honda" are the same brand but they are not the same strings. We can quickly solve this problem by converting all the characters to lowercase or uppercase.

cars["make"] = cars["make"].str.lower()

# check the first 5 values in the make column
cars["make"][:5]

# output
0      nissan
1    chrysler
2     porsche
3       honda
4     lincoln
Name: make, dtype: object

Going more specific

When I check the unique values in the "make" column, I see values like 2007 and 2014, which do not represent any brands (at least to my knowledge). This must be because of wrong data entry in free forms, which is a common issue when scraping data from the web.

Let's check the year, make, and model values for these unexpected brands.

cars[cars["make"].isin(["2007", "2014"])]

The isin method allows for filtering rows based on multiple values. The output of this line of code is:

The issue seems to be entering the year value twice, which caused the brand (i.e. make) value switched next to the model. There are different alternatives for fixing this issue.

One that comes to my mind is to split the model value and use the first part after splitting as a replacement for the value in the make column as illustrated below:

The following line of code splits the model column for rows in which the make value is 2007.

cars.loc[cars["make"].isin(["2007"]), "model"].str.split(" ", expand=True)

It looks a bit weird. The reason is that some of the values contain a space character at the beginning, which results in 3 items after splitting. For the ones that do not have a space character at the beginning, the last item becomes None. We can solve this issue by using the strip method, which removes the leading and trailing whitespaces.

cars.loc[cars["make"].isin(["2007"]), "model"].str.strip().str.split(" ", expand=True)

It is accurate now and looks much better. The next step is to apply this code to all the rows that have a year value in the make column.

To find the make values to replace (i.e. year in the make column), you can execute the following line of code:

pd.Series(cars["make"].unique()).sort_values()

It finds the unique values in the make column and sorts them, which puts the year values at the beginning. The first 5 rows of the output is as follows:

We can now create a list that contains 2007 and 2014 and then use it for filtering the rows in which the make values need to be replaced. The following code snippet performs the task we need:

to_replace = ["2007", "2014"]

cars.loc[cars["make"].isin(to_replace), "make"] = 
cars.loc[cars["make"].isin(to_replace), "model"].str.strip().str.split(" ", expand=True)[0]

Let's confirm by checking if there are any year values in the make column.

pd.Series(cars["make"].unique()).sort_values()

All the year values in the make column are gone.

Final words

Last but not least, there are many duplicate rows in the DataFrame. We can eliminate them using the drop_duplicates function. By default, it removes rows in which all the column values are the same. If you want to determine duplicates based on the values in a column or a set of columns, you can use the subset parameter.

Data Cleaning is an essential part of any data workflow. Whether you are performing exploratory data analysis or building a complex ML system, you need to make sure the data is cleaned. Otherwise, what you produce is likely to be unreliable.

Thank you for reading. Please let me know if you have any feedback.

Tags: Data Cleaning Data Science Hands On Tutorials Programming Python