Polars + NVIDIA GPU Tutorial
Dealing with massive datasets inside Python has always been a challenge. The language is not tailored for handling huge amounts of data as native SQL systems or spark.
The most famous library for handling 2-D datasets within Python is, without any question, pandas. Although easy to use and used by every data scientist, Pandas is written in Python and C, making it a bit combersume and slow to perform operations on large data. If you are a data scientist, you've dealt with the pain of waiting 200 years for a group by to finish.
One of the libraries that aims to solve this is polars -an extremely efficient Python package that is able to handle large datasets, mostly for the following reasons:
- It's written in Rust
- It leverages multi-threading automatically
- It defers most calculations by using lazy evaluation
And.. after today, you can now leverage Nvidia's hardware to maximize polars‘ GPU engine capabilities.
In this blog post, we'll see how you can leverage polars+GPU and speed up your data pipelines enormously.
Setting Up the Environment
To follow along this blog post, you can check the accompanying notebook here.
First, let's install polars
with GPU capabilities by using:
!pip install polars[gpu] --extra-index-url=https://pypi.nvidia.com
We need to bring out the big guns to test our data pipelines. We'll be using a large dataset of Simulated Transactions – a large dataset representing different monetary transactions of customers.
As I'm assuming you are using Google Colab and we want to keep ourselves on the free tier, I'll use a sample of this dataset that only uses 20% of the size:
!wget https://storage.googleapis.com/rapidsai/polars-demo/transactions-t4-20.parquet -O transactions.parquet
But, if you have your own system and resources that can handle the size of the total data, feel free to download the original file:
!wget https://storage.googleapis.com/rapidsai/polars-demo/transactions.parquet -O transactions.parquet

Finally, let's load polars
:
import polars as pl
from polars.testing import assert_frame_equal
Setup is done, time to play around with our dataset!
Simple Calculations
We need to load our dataset into our polars
engine by leveraging the scan_parquet
method:
transactions = pl.scan_parquet("transactions.parquet")
We can take a peak at our first 5 transactions using head
and collect
:
transactions.head(5).collect()

The simples use case we can think of is to perform a simple aggregation of a numerical column. Let's sum the total amount of the transactions on our table using polars
syntax:
%%time
transactions.select(pl.col("AMOUNT").sum()).collect()

CPU time for this operation was 1.36s on the 52 million rows. We'll keep track of the execution time of our code throughout our experiments for benchmarking purposes.
Let's see how this operation works on the gpu_engine
:
gpu_engine = pl.GPUEngine(
device=0,
raise_on_fail=True,
)
%%time
transactions.select(pl.col("AMOUNT").sum()).collect(engine=gpu_engine)

Nice improvement! CPU time was ~4 times faster. Will this stand for complex calculations?
Using Group By + Aggregations
One common operation we do with 2 dimensional structures is performing group by + any type of aggregation. These types of aggregations typically show up when quick answers are needed for specific business questions or during particular steps in the data wrangling process.
For example, which customers are the highest spenders in our dataset? Let's calculate that using the CPU:
%%time
high_spenders_cpu = (
transactions
.group_by("CUST_ID")
.agg(pl.col("AMOUNT").sum())
.sort(by="AMOUNT", descending=True)
.head()
.collect()
)
high_spenders_cpu
The cpu_time
for this operation was (drumroll):

4.27 seconds, not bad. But what if we do this operation on our GPU?
%%time
high_spenders_gpu = (
transactions
.group_by("CUST_ID")
.agg(pl.col("AMOUNT").sum())
.sort(by="AMOUNT", descending=True)
.head()
.collect(engine=gpu_engine)
)
high_spenders_gpu

345 ms for an aggregation on a 52 million row dataframe. That's actually a great result!
And what's nice is that we can also rely on polars SQL
interface to achieve similar results:
Python">sql_query = """
SELECT CUST_ID, SUM(AMOUNT) as sum_amt
FROM transactions
GROUP BY CUST_ID
ORDER BY sum_amt desc
LIMIT 5
"""
%time pl.sql(sql_query).collect()
%time pl.sql(sql_query).collect(engine=gpu_engine)

In conclusion, we were able to speed up our execution time in 5x (up to 10x) in aggregation operations by using the GPU capabilities of polars and it was awesome.
Standard polars
already shows extremely good results in other benchmarks, but prepping it up with GPU capabilities is totally a new game. In a nutshell:
- The fastest performance for Polars is on NVIDIA GPUs, up to 10x speed up over CPUs
- The Polars GPU engine enables users to process 100s of millions of rows of data in seconds
- Access to the Polars GPU engine requires almost zero code changes to existing Polars code
You can also check more examples and experiments by visiting the official repo.
Last, but not least, thank you to NVIDIA for giving me early access to the package and being able to experiment with this new accelerated version of polars. Looking forward to new developments!
If you want to read / see more content related to AI and DS, subscribe my youtube Channel "The Data Journey":

https://www.youtube.com/@TheDataJourney42
[The dataset used in this blog post is under licence CC0: Public Domain]