Need for Speed: Comparing Pandas 2.0 with Four Python Speed-Up Libs (with Code)

2023, so far, has been a year that brought phenomenal progress to the field of Machine Learning, with the development and worldwide distribution of advanced Large Language Models (LLMs). However, machine learning proficiency is not restricted to fine-tuning LLLMs and prompt engineering. A machine learning expert knows what happens under the hood, is able to explain/optimize various system behaviors, and ultimately is responsible for the overall quality, business-need-fit, and performance of the machine learning solution.
Besides performance comparison, the article aims to educate the reader on how to implement various Python speed-up operations by offering code examples that illustrate the key points.
Speaking of knowing what happens under the hood and performance, this article will compare the performance of the newly released Pandas 2.0 against four other speed-up Python libraries and operations: Polars, RAPIDS.ai cuDF, Dask, and Numba. The code was executed on an ASUS Rig Strix Scar (2022) gaming laptop with NVIDIA GeForce RTX 3080 Ti Laptop GPU and Intel Core i9 12900Hz processor. Besides performance comparison, the article aims to educate the reader on how to implement various Python speed-up operations by offering code examples that illustrate the key points.
In our code examples, we will work with a synthetic .csv file that contains 500K rows that simulate a superstore's merchandise prices that has two locations. The file has three columns; column 1, called Store1, contains the inventory prices of the first superstore location, and column 2, called Store2, contains the inventory prices of the second superstore location. The prices of Store2 are slightly higher (up to 20% higher). Finally, column 3, called Discountability, contains only 0s and 1s and tells us whether an item can be discounted. Zero means the item can not be discounted, and one means it can be discounted. The code for the creation of the Pandas 2.0 DataFrame and synthetic .csv file is shown below.
Key point: Note that col2 (prices of Store2) is created from the numpy arrays col1 and var using a vectorized operation without having to use loops. We can do that because NumPy supports vectorized operations.
A. Dataframe Creation Performance Comparison
The first operation for which we will compare the performance of different approaches is that of a DataFrame creation. Specifically, we will compare the performance of Pandas 2.0, Dask, and Polars. In the following two sections, we will discuss the code implementations, and then in section A.3, we will compare the performance.
A.1. With Pandas 2.0
One of the essential new features in Pandas 2.0 is the addition of Apache Arrow (pyarrow) backing memory format. The main advantage of Apache Arrow is that it can perform faster and more memory-efficient operations. In the code below, we show the construction of a Pandas Dataframe with and without the use of pyarrow.
Keypoint: Note that once we specify the backend type (_dtypebackend) to be pyarrow, we need to specify the engine as pyarrow as well.
A.2 With Polars and Dask
Polars is a turbocharged Python DataFrame library that implements the Apache Arrow memory model and is written in Rust. Similarly to Apache Arrow, it is both fast and memory-efficient, which makes it particularly suitable for data-intensive computing. In order to create the Polars DataFrame, all we need to do is to import the Polars library and invoke its _readcsv() method.
Dask is an open-source framework that enables parallel and distributed computing in Python, and therefore, it is particularly suited for data-intensive applications. In order to create a Dask DataFrame, we must first import the dask.dataframe library and then call its _readcsv() method.
Key point. Be careful when installing the Dask library because Pandas is one of its dependencies, and it will install it. At the time of this writing, Dask install will not install Pandas 2.0 but an older version. So, if you want to work with Dask and not overwrite your Pandas 2.0 installation, create a virtual environment and install Dask in it.
A.3 Dataframe Creation Execution Time Comparison
Timing of operations is implemented using the timeit library, as shown below.
Below are shown the execution times for DataFrame creation from fastest to slowest. The fastest are Polars and Dask, followed closely by Pandas 2.0, with a pyarrow backend. As for the traditional Pandas DataFrame creation, it is 4.37 times slower than the former. So, even from the first step of creating our Pandas DataFrame, pyarrow makes a difference.
- With Dask: 0.020656200009398162
- With Polars: 0.027874399966094643
- With pyarrow-enabled Pandas 2.0: 0.04491699999198317
- With Pandas 2.0 (without pyarrow): 0.196299700008239
In addition to measuring the execution time, I also performed some estimations of memory usage. The _cuDF D_ataFrame has the smallest memory footprint (0.12MB), while the Polars DataFrame has one of about 7.63MB, and the Pandas DataFrame has a memory footprint of about 11.44MB. Note that the time for creating the cuDF DataFrame is not provided because the .csv file is stored in the CPU, so we can not harness the GPU capabilities of cuDF for this operation. We will do so later in the article.
B. Performance Comparison of DataFrame Operations
In order to compare the performance of the aforementioned libraries when they are tasked with DataFrame operations, let us assume the following: All eligible merchandise of Store1 will be discounted by 20%, all eligible merchandise of Store2 will be discounted by 30%, and the discounted prices will be saved in a new dataframe. We use the word "eligible" because, as discussed above, items that have a Discountability of 0 can not be discounted. So, we will perform an execution time comparison on applying a function to the rows of a DataFrame, which is a prevalent task.
Code implementations are shown in sections B.1-B.8, and the performance comparison is shown in section B.9.
B.1 Pandas with pyarrow
In our first experiment for DataFrame operations, we will harness the capabilities of Apache Arrow, given its recent interoperability with Pandas 2.0. As shown in the first line of the code below, we convert a Pandas DataFrame to a pyarrow Table, which is an efficient way to represent columnar data in memory. Each column is stored separately, which allows for efficient compression and data queries. Then, the Store1, Store2, and Discountability columns are passed to the function _scalecolumns(), which scales the columns by the appropriate discount (0.2 or 0.3) and the mask value (0 or 1) of the Discountability column. The scaled columns are returned by the function as a tuple. Finally, the Table _resulttable is converted to a Pandas DataFrame.
Key point: Multiplications inside the function _scalecolumns() are implemented using the pyarrow.compute function multiply(). Note that EVERY multiplication has to be implemented using multiply(). For example, if above, we replaced _pc.multiply(0.2,maskcol) with _pc.multiply(0.2*maskcol), we would get an error. Finally, we use the _subtract() f_unction of pyarrow.compute for subtraction.
B2. With the Pandas method apply()
Our second experiment will use the Pandas DataFrame apply() method that performs row-wise operations. The function _scalecolumns() performs the scaling in the code below. Note that the actual scaling is done inside the nested function _discountstore().
Key point: The nested function returns a Pandas Series that contains a dictionary where the key is the column, and the value is the scaled store price. You may wonder why we are returning this type of object. The reason is that the Pandas apply() method returns a Series where the index is the column name. Thus sending to apply() an object that has a similar structure to what it has to return facilitates computations. For education purposes, in the github directory of my code (link at the end of article), I provide two implementations using apply(). The one shown here, and one more, where the nested function returns the scaled values as a tuple. This implementation is more computationally intensive because the returned type is a different structure from what apply() has to return.
B.3. With Pandas itertuples()
itertuples() is a fast Pandas row iterator that produces namedtuples (column-name, value-of-corresponding-row). The code that implements the scaling of the store prices and computes the discounted prices using itertuples() is shown below.
Key point: The Pandas itertuples() method is often faster than Pandas apply(), particularly for larger datasets, as in our code example. The reason is that apply() calls a function for every row, while itertuples() does not and can take advantage of vectorized operations while returning a lightweight iterator that does not create new objects and is memory-efficient.
B.4 Pandas Vectorized Operations
And now, we have come to the most elegant and fastest way to compute discounted store prices: Vectorization! This is a way that allows the application of the desired operations to entire arrays at once instead of iterating using loops. The implementation is shown below.
B.5 With Numba
Numba is a just-in-time (JIT) compiler for Python that, at runtime, translates Python code into optimized machine code. It is instrumental in optimizing operations involving loops and NumPy arrays.
The @numba.njit symbol shown below is a decorator that tells Numba that the function that follows is to be compiled into machine code.
B.6 With Dask
Similar to NumPy, Dask offers vectorized operations, with the additional advantage that these operations can be applied in a parallel and distributed manner.
Additional advantages of Dask include: (a) Lazy evaluation, i.e., Dask arrays and operations are built on a task graph and executed only when the result is requested, by calling compute(), for example. (b) Out of core processing, i.e., it can process datasets that do not fit into memory. ( c ) Integration with Pandas.
B.7 With Polars
Polars offers vectorized operations that leverage the speed and memory efficiency of the Rust language. Similar to Dask, it offers lazy evaluation, out-of-core processing, and integration with Pandas. Implementation is shown below.
B.8 With RAPIDS.ai cuDF
Finally, we use cuDF to implement the price discounting function via vectorized operations. The cuDF library is built on top of CUDA, and so its vectorized operations leverage the computational strength of GPUs for faster processing. One handy feature of cudF is the offering of CUDA kernels. These are optimized functions that harness the parallel nature of GPUs. The cuDF implementation is shown below.
B.9 Execution Time Comparison of Function Application
Similar to the timing of DataFrame creation, timeit was used to measure the execution time of the price discounting function and its saving in a new DataFrame. Below are the execution times, ranked from fastest to slowest.
- With RAPIDS.ai, cuDF: 0.008894977000011295
- With Polars: 0.009557300014421344
- With pyarrow Table: 0.028865800006315112
- With Pandas 2.0 vectorized operations: 0.0536642000079155
- With Dask: 0.6413181999814697
- With Numba: 0.9497981000000095
- With Pandas 2.0 itertuples(): 1.0882243000087328
- With Pandas 2.0 apply(): 197.63155489997007
Not surprisingly, the GPU-enabled Rapids.ai cuDF achieves the fastest time, followed very closely by the lightning-fast Polars library. The execution times of pyarrow Table and Pandas 2.0 vectorized operations follow closely, too; cuDF's execution time is only about 4.64 times faster on average than the latter two. Dask, Numba, and the _Panda_s method itertuples() have an inferior but reasonable performance (approximately 72, 107, and 122 times slower than Polars, respectively). Finally, the Pandas apply() has an outstandingly inferior performance to all other methods, which is also not surprising given that this method works in a row-wise, relatively slow manner for large datasets.
Key points: (a) Generally speaking, apply() is an excellent way to implement functions on Pandas DataFrames for small to medium-sized datasets. But, when dealing with large datasets, as in our case, it is best to look at other ways to implement the functions. (b) If you decide to use apply() in Pandas 2.0, ensure that this is done on a DataFrame created the standard way and not with pyarrow in the back end. The reason is that a very significant datatype conversion overhead will slow your computations considerably.
C. Conclusion
In this article, I described several ways to speed up Python code applied to a large dataset, with a particular focus on the newly released Pandas 2.0, with a pyarrow back-end. The different speed-up techniques were compared performance-wise for two tasks: (a) DataFrame creation and (b) Application of a function on the rows of the DataFrame. The news is indeed excellent for Pandas 2.0 users; Pandas DataFrame vectorized operations and pyarrow Table (extracted from a Pandas DataFrame) achieved comparable performance to Rapids.ai cuDF and Polars library.
The entire code is in the github directory: link
Thank you for reading!