2 Simple Steps To Reduce the Memory Usage of Your Pandas Dataframe

Data analysts and scientists work with large datasets on a day-to-day basis. The dataset size can quickly become an issue if it consumes all the available memory or when you need to manipulate the data in memory.
The dataset size can quickly become an issue if it consumes all the available memory
In this scenario, we will look at two simple tricks you can apply when loading a large dataset from a CSV file to a Pandas dataframe to fit into your available RAM:
Disclaimer: These steps can help reduce the amount of required memory of a dataframe, but they can't guarantee that your dataset will be small enough to fit in your RAM after applying them.
Step 1: Explore Which Columns Are Relevant for You
Seldomly you will need all columns of a dataset in your memory. That is why the first step is to review which columns are relevant for you.
Since the whole dataset does not fit in your RAM at this point, the simple trick is to load only the first few rows (e.g., nrows = 1000
) of the dataset.
Python">import pandas as pd
df = pd.read_csv(file_path,
nrows = 1000)
The number of rows nrows
should be
- small enough to fit the dataframe in your RAM and
- large enough to decide for each column if it is relevant for you or not.
Once you have evaluated which columns you need for your purposes, you can then proceed to load only those into the dataframe:
df = pd.read_csv(file_path,
usecols = ["column_1", ..., "column_n"])
Step 2: Reduce Data Types (Downcasting)
Since Pandas loads columns into the widest data type (e.g., integers as int64
) by default, your initial dataframe might be larger than necessary. Thus, the second step is to evaluate whether you can reduce the data type to a narrower one.
For example, if you can reduce one column from an int64
(8 bytes) to an int8
(1 byte) in a dataframe with 1 million rows, you'll save 7 MB ((8–1) bytes*1,000,000 rows) – now imagine if you can do this for multiple columns.
For this step, it is essential to load the entire column to explore its minimum and maximum values. If the dataset with the reduced amount of columns from step 1 still does not fit into your memory, you can load subsets of columns for this step.
Before we dive in, here are some handy Pandas commands to evaluate the columns' data types and their memory usage:
Data types – You can use the df.info()
method or the df.dtypes
attribute to find each columns data type
Memory usage – To find how many bytes one column and the whole dataframe are using, you can use the following commands:
df.memory_usage(deep = True)
: How many bytes is each column?df.memory_usage(deep = True).sum()
: How many bytes is the whole dataframe?df.info(memory_usage = "deep")
: How many bytes is the whole dataframe?
Now, how do you choose the correct data type?
First, answer the question "What generic data type do you have?" – A boolean, an integer, a floating point number, or a string?
Boolean
If you have a boolean variable, it would be a shame to be using an 8 byte int64
to store zeros and ones.
Make sure that you are using one of the following two:
bool
forFalse
orTrue
uint8
for0
or1
Both only take up 1 byte.
Integers
For integers, you need to look at the entire value range of the variable. Specifically, you need to determine the minimum and maximum values.
# Get the statistical properties of the full dataframe
df.describe()
# Or specifically get minimum and maximum values
df['column_1'].min()
df['column_1'].max()
If you are more of a visual type, you can alternatively plot the boxplot of that column:
import seaborn as sns
sns.boxplot(data = df['column_1']);
If the minimum value is larger or equal to 0, you will use an unsigned integer. Then select the narrowest data type according to the maximum value.
uint8
: 1 byte unsigned integer (0
to255
)uint16
: 2 byte unsigned integer (0
to65,535
).uint32
: 4 byte unsigned integer (0
to4,294,967,295
).uint64
: 8 byte unsigned integer (0
to18,446,744,073,709,551,615
).
If the minimum value is smaller than 0, you will use a signed integer. Then select the narrowest data type according to the minimum and maximum value.
int8
: 1 byte signed integer (-128
to127
)int16
: 2 byte signed integer (-32,768
to32,767
)int32
: 4 byte signed integer (-2,147,483,648
to2,147,483,647
)int64
: 8 byte signed integer (-9,223,372,036,854,775,808
to9,223,372,036,854,775,807
)
Floating point numbers
Similarly to integers, you need to know the minimum and maximum values to determine the required width of a float data type. Additionally, you need to know the required precision points:
float16
(half-precision): 2 byte floating-point number (approximately-65,504
to65,504
with approximately 4 decimal digits)float32
(single-precision): 4 bytes floating-point number (approximately-3.4 x 10^38
to3.4 x 10^38
with approximately 7 decimal digits)float64
(double-precision): 8 bytes floating-point number (approximately-1.8 x 10^308
to1.8 x 10^308
with approximately 16 decimal digits)
Please note that the precision may be limited to fewer decimal digits due to the limited number of bits available to represent the number.
Strings and Objects
Since strings have variable lengths, they are stored as object
in a dataframe by default. You could store them as str
with a fixed maximum length:
# Convert object data type to string with a maximum length of 4 characters
df['column_1'] = df['column_1'].astype('|S4')
However, a string of length N
uses N
bytes.
Thus, you might want to review if any of the following options apply:
Categorical variables – If you have a categorical variable, you could label encode or one hot encode the column.
Identifiers – A common type of string variable is an identifier. If applicable, you could experiment with converting the string to an integer.
For example, you could take the last 16 characters of the identifier string and convert it to an int64
as shown below. Just make sure that each identifier stays unique in this process.
df['column_id'] = df['column_id'].apply(lambda x: int(x[-16:], 16)).astype('int64')
In the process of downcasting or narrowing the data types, you need to be aware of the loss of precision for floating point numbers and overflow for integers.
For example, let's see what happens when we cast an uint16
with a = 256
to an uint8
(value range of 0
to 255
).

As you can see, the largest number that a uint8
can hold in 1 byte (or 8 bits) is 255. If you want to store the number 256, you need at least 9 bits. Since a uint8
will only look at the first 8 bits, the 9th bit will be cut off.
Thus, if you downcast an uint16
of a = 256
to an uint8
, a = 0
.
Now that you know which data type each relevant column should have, you can downcast them. Since a CSV file does not contain any information about the data types of its columns, you have to specify the data type for each column during loading.
df = pd.read_csv(file_path,
usecols = ["column_1", ..., "column_n"],
dtype = {"column_1" : np.int8,
...,
"column_n" : np.float8,
}
)
You can also change the data type of a column after a dataframe is loaded:
df['column_1'] = df['column_1'].astype(np.int8)
Summary
This article explained how you could reduce the memory size of a Pandas dataframe when loading a large dataset from a CSV file. We learned that reducing the number of columns and then downcasting their data types can significantly reduce the amount of required memory. By utilizing these techniques, we can ensure that we use the available RAM efficiently.
After you apply these tricks, you can also save the dataframe in an alternative file format to CSV that remembers the data types of the columns (e.g., Parquet or Feather).
Additionally, to speed up processing on a large dataset, you could also have a look at the Pandas alternative Polars.
Enjoyed This Story?
Subscribe for free to get notified when I publish a new story.