Pandas 2.0: A Game-Changer for Data Scientists?

Due to its extensive functionality and versatility, pandas
has secured a place in every data scientist's heart.
From data input/output to data cleaning and transformation, it's nearly impossible to think about data manipulation without import pandas as pd
, right?
Now, bear with me: with such a buzz around LLMs over the past months, I have somehow let slide the fact that pandas
has just undergone a major release! Yep, pandas 2.0
is out and came with guns blazing!
Although I wasn't aware of all the hype, the Data-Centric AI Community promptly came to the rescue:

Fun fact: Were you aware this release was in the making for an astonishing 3 years? Now that's what I call "commitment to the community"!
So what does pandas 2.0
bring to the table? Let's dive right into it!
1. Performance, Speed, and Memory-Efficiency
As we all know, pandas
was built using numpy
, which was not intentionally designed as a backend for dataframe libraries. For that reason, one of the major limitations of pandas
was handling in-memory processing for larger datasets.
In this release, the big change comes from the introduction of the Apache Arrow backend for pandas data.
Essentially, Arrow is a standardized in-memory columnar data format with available libraries for several programming languages (C, C++, R, Python, among others). For Python there is PyArrow, which is based on the C++ implementation of Arrow, and therefore, fast!
So, long story short, PyArrow takes care of our previous memory constraints of versions 1.X and allows us to conduct faster and more memory-efficient data operations, especially for larger datasets.
Here's a comparison between reading the data without and with thepyarrow
backend, using the Hacker News dataset, which is around 650 MB (License CC BY-NC-SA 4.0):
As you can see, using the new backend makes reading the data nearly 35x faster. Other aspects worth pointing out:
- Without the
pyarrow
backend, each column/feature is stored as its own unique data type: numeric features are stored asint64
orfloat64
while string values are stored as objects; -
With
pyarrow
, all features are using the Arrow dtypes: note the[pyarrow]
annotation and the different types of data:int64
,float64
,string
,timestamp
, anddouble
:
2. Arrow Data Types and Numpy Indices
Beyond reading data, which is the simplest case, you can expect additional improvements for a series of other operations, especially those involving string operations, since pyarrow
‘s implementation of the string datatype is quite efficient:
In fact, Arrow has more (and better support for) data types than numpy
, which are needed outside the scientific (numerical) scope: dates and times, duration, binary, decimals, lists, and maps. Skimming through the equivalence between pyarrow-backed and numpy
data types might actually be a good exercise in case you want to learn how to leverage them.
It is also now possible to hold more numpy numeric types in indices.The traditional int64
, uint64
, and float64
have opened up space for all numpy numeric dtypes Index values so we can, for instance, specify their 32-bit version instead:
This is a welcome change since indices are one of the most used functionalities in pandas
, allowing users to filter, join, and shuffle data, among other data operations. Essentially, the lighter the Index is, the more efficient those processes will be!
3. Easier Handling of Missing Values
Being built on top of numpy
made it hard for pandas
to handle missing values in a hassle-free, flexible way, since numpy
does not support null values for some data types.
For instance, integers are automatically converted to floats, which is not ideal:
Note how points
automatically changes from int64
to float64
after the introduction of a singleNone
value.
There is nothing worst for a data flow than wrong typesets, especially within a data-centric AI paradigm.
Erroneous typesets directly impact data preparation decisions, cause incompatibilities between different chunks of data, and even when passing silently, they might compromise certain operations that output nonsensical results in return.
As an example, at the Data-Centric AI Community, we're currenlty working on a project around synthetic data for data privacy. One of the features, NOC
(number of children), has missing values and therefore it is automatically converted to float
when the data is loaded. The, when passing the data into a generative model as a float
, we might get output values as decimals such as 2.5 – unless you're a mathematician with 2 kids, a newborn, and a weird sense of humor, having 2.5 children is not OK.
In pandas 2.0, we can leverage dtype = 'numpy_nullable'
, where missing values are accounted for without any dtype changes, so we can keep our original data types (int64
in this case):
It might seem like a subtle change, but under the hood it means that now pandas
can natively use Arrow's implementation of dealing with missing values. This makes operations much more efficient, since pandas
doesn't have to implement its own version for handling null values for each data type.
4. Copy-On-Write Optimization
Pandas 2.0 also adds a new lazy copy mechanism that defers copying DataFrames and Series objects until they are modified.
This means that certain methods will return views rather than copies when copy-on-write is enabled, which improves memory efficiency by minimizing unnecessary data duplication.
It also means you need to be extra careful when using chained assignments.
If the copy-on-write mode is enabled, chained assignments will not work because they point to a temporary object that is the result of an indexing operation (which under copy-on-write behaves as a copy).
When copy_on_write
is disabled, operations like slicing may change the original df
if the new dataframe is changed:
When copy_on_write
is enabled, a copy is created at assignment, and therefore the original dataframe is never changed. Pandas 2.0 will raise a ChainedAssignmentError
in these situations to avoid silent bugs:
5. Optional Dependencies
When using pip
, version 2.0 gives us the flexibility to install optional dependencies, which is a plus in terms of customization and optimization of resources.
We can tailor the installation to our specific requirements, without spending disk space on what we don't really need.
Plus, it saves a lot of "dependency headaches", reducing the likelihood of compatibility issues or conflicts with other packages we may have in our development environments:
Taking it for a spin!
Yet, the question lingered: is the buzz really justified? I was curious to see whether pandas 2.0
provided significant improvements with respect to some packages I use on a daily basis: ydata-profiling, matplotlib, seaborn, scikit-learn.
From those, I decided to take ydata-profiling for a spin— it has just added support for pandas 2.0, which seemed like a must-have for the community! In the new release, users can rest to sure that their pipelines won't break if they're using pandas 2.0, and that's a major plus! But what else?
Truth be told, ydata-profiling has been one of my top favorite tools for exploratory data analysis, and it's a nice and quick benchmark too – a 1-line of code on my side, but under the hood it is full of computations that as a data scientist I need to work out – descriptive statistics, histogram plotting, analyzing correlations, and so on.
So what better way than testing the impact of the pyarrow
engine on all of those at once with minimal effort?
Again, reading the data is definitely better with the pyarrow
engine, althought creating the data profile has not changed significanlty in terms of speed.
Yet, differences may rely on memory efficiency, for which we'd have to run a different analysis. Also, we could further investigate the type of analysis being conducted over the data: for some operations, the difference between 1.5.2 and 2.0 versions seems negligible.
But the main thing I noticed that might make a difference to this regard is that ydata-profiling is not yet leveraging the pyarrow
data types. This update could have a great impact in both speed and memory and is something I look forward in future developments!
The Verdict: Performance, Flexibility, Interoperability!
This new pandas 2.0
release brings a lot of flexibility and performance optimization with subtle, yet crucial modifications "under the hood".
Maybe they are not "flashy" for newcomers into the field of data manipulation, but they sure as hell are like water in the desert for veteran data scientists that used to jump through hoops to overcome the limitations of the previous versions.
Wrapping it up, these are the top main advantages introduced in the new release:
- Performance Optimization: With the introduction of Apache Arrow backend, more numpy dtype indices, and copy-on-write mode;
- Added flexibility and customization: Allowing users to control optional dependencies and taking advantage of the Apache Arrow data types (including nullability from the get go!);
- Interoperability: Perhaps a less "acclaimed" advantage of the new version but with huge impact. Since Arrow is language-independent, in-memory data can be transferred between programs built not only on Python, but also R, Spark, and others using Apache Arrow backend!
And there you have it, folks! I hope this wrap up as quieted down some of your questions around pandas 2.0
and its applicability on our data manipulation tasks.
I'm still curious whether you have found major differences in you daily coding with the introduction of pandas 2.0
as well! If you're up to it, come and find me at the Data-Centric AI Community and let me know your thoughts! See you there?
About me
Ph.D., Machine Learning Researcher, Educator, Data Advocate, and overall "jack-of-all-trades". Here on Medium, I write about Data-Centric AI and Data Quality, educating the Data Science & Machine Learning communities on how to move from imperfect to intelligent data.
Data-Centric AI Community | GitHub | Google Scholar | LinkedIn