Polars + NVIDIA GPU Tutorial

Author:Murphy | View: 24229 | Time: 2025-03-23 11:25:58

Dealing with massive datasets inside Python has always been a challenge. The language is not tailored for handling huge amounts of data as native SQL systems or spark.

The most famous library for handling 2-D datasets within Python is, without any question, pandas. Although easy to use and used by every data scientist, Pandas is written in Python and C, making it a bit combersume and slow to perform operations on large data. If you are a data scientist, you've dealt with the pain of waiting 200 years for a group by to finish.

One of the libraries that aims to solve this is polars -an extremely efficient Python package that is able to handle large datasets, mostly for the following reasons:

It's written in Rust
It leverages multi-threading automatically
It defers most calculations by using lazy evaluation

And.. after today, you can now leverage Nvidia's hardware to maximize polars‘ GPU engine capabilities.

In this blog post, we'll see how you can leverage polars+GPU and speed up your data pipelines enormously.

Setting Up the Environment

To follow along this blog post, you can check the accompanying notebook here.

First, let's install polars with GPU capabilities by using:

!pip install polars[gpu] --extra-index-url=https://pypi.nvidia.com

We need to bring out the big guns to test our data pipelines. We'll be using a large dataset of Simulated Transactions – a large dataset representing different monetary transactions of customers.

As I'm assuming you are using Google Colab and we want to keep ourselves on the free tier, I'll use a sample of this dataset that only uses 20% of the size:

!wget https://storage.googleapis.com/rapidsai/polars-demo/transactions-t4-20.parquet -O transactions.parquet

But, if you have your own system and resources that can handle the size of the total data, feel free to download the original file:

!wget https://storage.googleapis.com/rapidsai/polars-demo/transactions.parquet -O transactions.parquet

Preview of Transactions Table – Image by Author

Finally, let's load polars :

import polars as pl
from polars.testing import assert_frame_equal

Setup is done, time to play around with our dataset!

Simple Calculations

We need to load our dataset into our polarsengine by leveraging the scan_parquet method:

transactions = pl.scan_parquet("transactions.parquet")

We can take a peak at our first 5 transactions using head and collect :

transactions.head(5).collect()

Transactions Table in Pandas – Image by Author

The simples use case we can think of is to perform a simple aggregation of a numerical column. Let's sum the total amount of the transactions on our table using polarssyntax:

%%time
transactions.select(pl.col("AMOUNT").sum()).collect()

CPU time for this operation was 1.36s on the 52 million rows. We'll keep track of the execution time of our code throughout our experiments for benchmarking purposes.

Let's see how this operation works on the gpu_engine :

gpu_engine = pl.GPUEngine(
    device=0,
    raise_on_fail=True,
)

%%time
transactions.select(pl.col("AMOUNT").sum()).collect(engine=gpu_engine)

Aggregate Execution with gpu_engine – Image by Author

Nice improvement! CPU time was ~4 times faster. Will this stand for complex calculations?

Using Group By + Aggregations

One common operation we do with 2 dimensional structures is performing group by + any type of aggregation. These types of aggregations typically show up when quick answers are needed for specific business questions or during particular steps in the data wrangling process.

For example, which customers are the highest spenders in our dataset? Let's calculate that using the CPU:

%%time

high_spenders_cpu = (
    transactions
    .group_by("CUST_ID")
    .agg(pl.col("AMOUNT").sum())
    .sort(by="AMOUNT", descending=True)
    .head()
    .collect()
)
high_spenders_cpu

The cpu_time for this operation was (drumroll):

Group By Aggregation Operation – Image by Author

4.27 seconds, not bad. But what if we do this operation on our GPU?

%%time

high_spenders_gpu = (
    transactions
    .group_by("CUST_ID")
    .agg(pl.col("AMOUNT").sum())
    .sort(by="AMOUNT", descending=True)
    .head()
    .collect(engine=gpu_engine)
)
high_spenders_gpu

Group By Aggregation Operation on GPU Engine— Image by Author

345 ms for an aggregation on a 52 million row dataframe. That's actually a great result!

And what's nice is that we can also rely on polars SQL interface to achieve similar results:

Python">sql_query = """
SELECT CUST_ID, SUM(AMOUNT) as sum_amt
FROM transactions
GROUP BY CUST_ID
ORDER BY sum_amt desc
LIMIT 5
"""

%time pl.sql(sql_query).collect()
%time pl.sql(sql_query).collect(engine=gpu_engine)

Group By Aggregation Operation with Polars SQL interface— Image by Author

In conclusion, we were able to speed up our execution time in 5x (up to 10x) in aggregation operations by using the GPU capabilities of polars and it was awesome.

Standard polars already shows extremely good results in other benchmarks, but prepping it up with GPU capabilities is totally a new game. In a nutshell:

The fastest performance for Polars is on NVIDIA GPUs, up to 10x speed up over CPUs
The Polars GPU engine enables users to process 100s of millions of rows of data in seconds
Access to the Polars GPU engine requires almost zero code changes to existing Polars code

You can also check more examples and experiments by visiting the official repo.

Last, but not least, thank you to NVIDIA for giving me early access to the package and being able to experiment with this new accelerated version of polars. Looking forward to new developments!

If you want to read / see more content related to AI and DS, subscribe my youtube Channel "The Data Journey":