Practical Introduction to Polars

Author:Murphy  |  View: 26218  |  Time: 2025-03-23 11:29:37

This article is not about comparing Polars with Pandas or highlighting their differences. It's a story about how adding a new tool can be beneficial not only for Data Science professionals but also for others who work with data. I like Polars because it is multithreaded, providing strong performance out-of-the-box, and it supports Lazy evaluation with query optimization capabilities. This tool will undoubtedly enhance your data skills and open up new opportunities.

Although Polars and Pandas are different libraries, they share similarities in their APIs. Drawing parallels between them can make it easier for those familiar with the Pandas API to start using Polars. Even if you're not familiar with Pandas and want to start learning Polars, it will still be incredibly useful and rewarding.


We will look at the most common actions that, in my experience, are most often used for data analysis. To illustrate the process of using Polars, I will consider an abstract task with reproducible data, so you can follow all the steps on your computer.

Imagine that we have data from three online stores, where we register user actions, such as viewing and purchasing. Let's assume that at any given time, only one action of each type can occur for each online store, and in case of a transaction error, our data might be missing the product identifier or its quantity. Additionally, for our task, we'll need a product catalog with prices for each item.

Let's formulate the main task: to calculate a summary table with the total purchase for each online store.

I will break down this task into the following steps:

  1. Data preparation and DataFrame creation.
  2. Summary statistics of the DataFrame.
  3. Retrieving the first five records.
  4. Renaming columns.
  5. Changing column types.
  6. Filling missing values.
  7. Removing missing values.
  8. Removing duplicate records.
  9. Filtering data.
  10. Selecting the required columns.
  11. Grouping data.
  12. Merging data with another DataFrame.
  13. Calculating a new column.
  14. Creating a Pivot table.

Let's get started!

Data Preparation and DataFrame Creation

We have the following data:

  • OnlineStore – indicates the store.
  • product – stores the product ID.
  • Action type – the type of action (either a view or a purchase).
  • quantity – the amount of the purchased or viewed product.
  • Action_time – the timestamp for the action.

Requirements:

polars==1.6.0 pandas==2.0.0

from dataclasses import dataclass
from datetime import datetime, timedelta
from random import choice, gauss, randrange, seed
from typing import Any, Dict

import polars as pl
import Pandas as pd

seed(42)
base_time= datetime(2024, 8, 31, 0, 0, 0, 0)

user_actions_data = [
    {
        "OnlineStore": choice(["Shop1", "Shop2", "Shop3"]),
        "product": choice(["0001", "0002", "0003"]),
        "quantity": choice([1.0, 2.0, 3.0]),
        "Action type": ("purchase" if gauss() > 0.6 else "view"),
        "Action_time": base_time - timedelta(minutes=randrange(1_000_000)),
    }
    for x in range(1_000_000)
]

corrupted_data = [
    {
        "OnlineStore": choice(["Shop1", "Shop2", "Shop3"]),
        "product": choice(["0001", None]),
        "quantity": choice([1.0, None]),
        "Action type": ("purchase" if gauss() > 0.6 else "view"),
        "Action_time": base_time - timedelta(minutes=randrange(1_000)),
    }
    for x in range(1_000)
]

For product catalog, which in our case include only product_id and its price (price).

product_catalog_data = {"product_id": ["0001", "0002", "0003"], "price": [100, 25, 80]}

The data is ready. Now let's create DataFrames using these data with Pandas and Polars:

# Pandas
user_actions_pd_df = pd.DataFrame(user_actions_data)
corrupted_pd_df = pd.DataFrame(corrupted_data)
product_catalog_pd_df = pd.DataFrame(product_catalog_data)

# Polars
user_actions_pl_df = pl.DataFrame(user_actions_data)
corrupted_pl_df = pl.DataFrame(corrupted_data)
product_catalog_pl_df = pl.DataFrame(product_catalog_data)

Since we have user_actions_df and corrupted_df, let's concatenate them into a single DataFrame.

Python"># Pandas
user_actions_pd_df = pd.concat([user_actions_pd_df, corrupted_pd_df])

# Polars
user_actions_pl_df = pl.concat([user_actions_pl_df, corrupted_pl_df])

In this way, we have easily created DataFrames for further work.

Of course, each method has its own parameters, so it's best to have the documentation handy to avoid confusion and use them appropriately.

Summary Statistics of the DataFrame

After loading or preparing data, it's useful to quickly explore the resulting dataset. For summary statistics, the method name remains the same, but the results may differ:

# Pandas
user_actions_pd_df.describe(include='all')
       OnlineStore  product      quantity Action type          Action_time
count      1001000  1000492  1.000510e+06     1001000              1001000
unique           3        3           NaN           2               632335
top          Shop3     0001           NaN        view  2024-08-30 22:02:00
freq        333931   333963           NaN      726623                    9
first          NaN      NaN           NaN         NaN  2022-10-06 13:23:00
last           NaN      NaN           NaN         NaN  2024-08-30 23:58:00
mean           NaN      NaN  1.998925e+00         NaN                  NaN
std            NaN      NaN  8.164457e-01         NaN                  NaN
min            NaN      NaN  1.000000e+00         NaN                  NaN
25%            NaN      NaN  1.000000e+00         NaN                  NaN
50%            NaN      NaN  2.000000e+00         NaN                  NaN
75%            NaN      NaN  3.000000e+00         NaN                  NaN
max            NaN      NaN  3.000000e+00         NaN                  NaN
# Polars
user_actions_pl_df.describe()
┌────────────┬─────────────┬─────────┬───────────┬─────────────┬────────────────────────────┐
│ statistic  ┆ OnlineStore ┆ product ┆ quantity  ┆ Action type ┆ Action_time                │
│ ---        ┆ ---         ┆ ---     ┆ ---       ┆ ---         ┆ ---                        │
│ str        ┆ str         ┆ str     ┆ f64       ┆ str         ┆ str                        │
╞════════════╪═════════════╪═════════╪═══════════╪═════════════╪════════════════════════════╡
│ count      ┆ 1001000     ┆ 1000492 ┆ 1.00051e6 ┆ 1001000     ┆ 1001000                    │
│ null_count ┆ 0           ┆ 508     ┆ 490.0     ┆ 0           ┆ 0                          │
│ mean       ┆ null        ┆ null    ┆ 1.998925  ┆ null        ┆ 2023-09-19 03:24:30.981698 │
│ std        ┆ null        ┆ null    ┆ 0.816446  ┆ null        ┆ null                       │
│ min        ┆ Shop1       ┆ 1       ┆ 1.0       ┆ purchase    ┆ 2022-10-06 13:23:00        │
│ 25%        ┆ null        ┆ null    ┆ 1.0       ┆ null        ┆ 2023-03-29 03:09:00        │
│ 50%        ┆ null        ┆ null    ┆ 2.0       ┆ null        ┆ 2023-09-19 06:49:00        │
│ 75%        ┆ null        ┆ null    ┆ 3.0       ┆ null        ┆ 2024-03-11 03:01:00        │
│ max        ┆ Shop3       ┆ 3       ┆ 3.0       ┆ view        ┆ 2024-08-30 23:58:00        │
└────────────┴─────────────┴─────────┴───────────┴─────────────┴────────────────────────────┘

As you can notice, Pandas calculates statistics differently for various data types and provides unique values for all columns. Polars, on the other hand, calculates the null_count value.

Additionally, in the Polars documentation, it is stated:

We do not guarantee the output of describe to be stable. It will show statistics that we deem informative, and may be updated in the future. Using describe programmatically (versus interactive exploration) is not recommended for this reason.

Retrieving the First Five Records

When encountering data for the first time, we always want to explore it. Beyond obtaining summary statistics, it's also important to see the actual records it contains. To do this, we often look at the first five records as a sample.

# Pandas
user_actions_pd_df.head()
  OnlineStore product  quantity Action type         Action_time
0       Shop3    0001       1.0        view 2024-05-21 09:24:00
1       Shop3    0001       3.0        view 2023-03-10 15:54:00
2       Shop3    0001       3.0        view 2024-03-24 19:02:00
3       Shop1    0003       3.0        view 2024-08-11 16:16:00
4       Shop3    0001       3.0        view 2024-03-23 11:32:00
# Polars
user_actions_pl_df.head()
┌─────────────┬─────────┬──────────┬─────────────┬─────────────────────┐
│ OnlineStore ┆ product ┆ quantity ┆ Action type ┆ Action_time         │
│ ---         ┆ ---     ┆ ---      ┆ ---         ┆ ---                 │
│ str         ┆ str     ┆ f64      ┆ str         ┆ datetime[μs]        │
╞═════════════╪═════════╪══════════╪═════════════╪═════════════════════╡
│ Shop3       ┆ 0001    ┆ 1.0      ┆ view        ┆ 2024-05-21 09:24:00 │
│ Shop3       ┆ 0001    ┆ 3.0      ┆ view        ┆ 2023-03-10 15:54:00 │
│ Shop3       ┆ 0001    ┆ 3.0      ┆ view        ┆ 2024-03-24 19:02:00 │
│ Shop1       ┆ 0003    ┆ 3.0      ┆ view        ┆ 2024-08-11 16:16:00 │
│ Shop3       ┆ 0001    ┆ 3.0      ┆ view        ┆ 2024-03-23 11:32:00 │
└─────────────┴─────────┴──────────┴─────────────┴─────────────────────┘

Polars has a useful glimpse() function that provides a dense preview of the DataFrame. It not only returns the first 10 records (or any number you specify using the max_items_per_column parameter) but also shows data types and record counts.

# Polars
user_actions_pl_df.glimpse()
Rows: 1001000
Columns: 5
$ OnlineStore           'Shop3', 'Shop3', 'Shop3', 'Shop1', 'Shop3', 'Shop2', 'Shop1', 'Shop2', 'Shop1', 'Shop2'
$ product               '0001', '0001', '0001', '0003', '0001', '0003', '0001', '0001', '0002', '0003'
$ quantity              1.0, 3.0, 3.0, 3.0, 3.0, 2.0, 3.0, 1.0, 2.0, 1.0
$ Action type           'view', 'view', 'view', 'view', 'view', 'view', 'view', 'view', 'view', 'view'
$ Action_time  2024-05-21 09:24:00, 2023-03-10 15:54:00, 2024-03-24 19:02:00, 2024-08-11 16:16:00, 2024-03-23 11:32:00, 2023-01-19 14:11:00, 2024-03-27 05:08:00, 2023-11-28 08:18:00, 2023-03-18 15:01:00, 2022-10-29 09:44:00

Renaming Columns

After exploring the data, it is often necessary to edit it for further use. If the column names are not satisfactory or if your company has its own naming conventions, you can easily rename them.

# Pandas
user_actions_pd_df = user_actions_pd_df.rename(
    columns={
        "OnlineStore": "online_store",
        "product": "product_id",
        "Action type": "action_type",
        "Action_time": "action_dt",
    }
)
# user_actions_pd_df.columns
Index(['online_store', 'product_id', 'quantity', 'action_type', 'action_dt'], dtype='object')
# Polars
user_actions_pl_df = user_actions_pl_df.rename(
    {
        "OnlineStore": "online_store",
        "product": "product_id",
        "Action type": "action_type",
        "Action_time": "action_dt",
    }
)
# user_actions_pl_df.columns
['online_store', 'product_id', 'quantity', 'action_type', 'action_dt']

Changing Column Types

When working with data, optimizing their processing is often a priority, and data types are no exception. Choosing the right type not only unlocks available functions but also saves memory. In our example, I will change the column type of quantity from float to int. In Pandas, you would use the astype() method, while in Polars, you use the cast() method.

# Pandas
user_actions_pd_df = user_actions_pd_df.astype({"quantity": "Int64"})
Int64Index: 1001000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column        Non-Null Count    Dtype         
---  ------        --------------    -----         
 0   online_store  1001000 non-null  object        
 1   product_id    1000492 non-null  object        
 2   quantity      1000510 non-null  Int64         
 3   action_type   1001000 non-null  object        
 4   action_dt     1001000 non-null  datetime64[ns]
dtypes: Int64(1), datetime64[ns](1), object(3)
memory usage: 46.8+ MB
# Polars
user_actions_pl_df = user_actions_pl_df.cast({"quantity": pl.Int32})
Rows: 1001000
Columns: 5
$ online_store           
$ product_id             
$ quantity               
$ action_type            
$ action_dt     

Polars has a special method estimated_size() that returns an estimate of the total (heap) allocated size of the DataFrame. For example:

user_actions_pl_df.estimated_size("mb")
# Result: 24.91054630279541

Although the method names for changing types differ, SQL enthusiasts will appreciate the ease of transition.

Filling Missing Values

In real projects, data is rarely perfect, and we often discuss with managers, analysts, and other systems how to interpret data behavior. During data preparation, I specifically generated corrupted_data to introduce some chaos into the data. Handling missing values could easily be the subject of an entire book.

There are several strategies for filling in missing values, and the choice of method depends on the task: sometimes filling missing values with zeros is sufficient, while other times the mean value may be used. In Polars, the fill_null() method can be applied both to the DataFrame and to specific columns. To add a new column or replace values in an existing one, the with_columns() method is also used.

In our example, I will fill missing values in the quantity column with 0:

# Pandas
user_actions_pd_df["quantity"].fillna(0, inplace=True)
Data columns (total 5 columns):
 #   Column        Non-Null Count    Dtype         
---  ------        --------------    -----         
 0   online_store  1001000 non-null  object        
 1   product_id    1000492 non-null  object        
 2   quantity      1001000 non-null  Int64         
 3   action_type   1001000 non-null  object        
 4   action_dt     1001000 non-null  datetime64[ns]
dtypes: Int64(1), datetime64[ns](1), object(3)
# Polars
user_actions_pl_df = user_actions_pl_df.with_columns(pl.col("quantity").fill_null(0))
┌────────────┬──────────────┬────────────┬──────────┬─────────────┬────────────────────────────┐
│ statistic  ┆ online_store ┆ product_id ┆ quantity ┆ action_type ┆ action_dt                  │
│ ---        ┆ ---          ┆ ---        ┆ ---      ┆ ---         ┆ ---                        │
│ str        ┆ str          ┆ str        ┆ f64      ┆ str         ┆ str                        │
╞════════════╪══════════════╪════════════╪══════════╪═════════════╪════════════════════════════╡
│ count      ┆ 1001000      ┆ 1000492    ┆ 1.001e6  ┆ 1001000     ┆ 1001000                    │
│ null_count ┆ 0            ┆ 508        ┆ 0.0      ┆ 0           ┆ 0                          │
└────────────┴──────────────┴────────────┴──────────┴─────────────┴────────────────────────────┘

In Polars, you can use various strategies for filling missing values in the data, such as: {None, 'forward', 'backward', 'min', 'max', 'mean', 'zero', 'one'}. The names of these strategies are self-explanatory, so we won't delve into their details.

It's also worth noting that for filling NaN values in floating-point columns, you should use the fill_nan() method, which does not involve strategies.

Removing Missing Values

Not all missing values can be filled, so those that cannot be correctly filled and used in further calculations are best removed. In our case, this applies to the product_id column, as we cannot compute the final result without this identifier.

To remove rows with missing values in Pandas and Polars, use the following methods:

# Pandas
user_actions_pd_df.dropna(subset=["product_id"], inplace=True)
Int64Index: 1000492 entries, 0 to 999
Data columns (total 5 columns):
 #   Column        Non-Null Count    Dtype         
---  ------        --------------    -----         
 0   online_store  1000492 non-null  object        
 1   product_id    1000492 non-null  object        
 2   quantity      1000492 non-null  Int64         
 3   action_type   1000492 non-null  object        
 4   action_dt     1000492 non-null  datetime64[ns]
dtypes: Int64(1), datetime64[ns](1), object(3)
# Polars
user_actions_pl_df = user_actions_pl_df.drop_nulls(subset=["product_id"])
┌────────────┬──────────────┬────────────┬────────────┬─────────────┬────────────────────────────┐
│ statistic  ┆ online_store ┆ product_id ┆ quantity   ┆ action_type ┆ action_dt                  │
│ ---        ┆ ---          ┆ ---        ┆ ---        ┆ ---         ┆ ---                        │
│ str        ┆ str          ┆ str        ┆ f64        ┆ str         ┆ str                        │
╞════════════╪══════════════╪════════════╪════════════╪═════════════╪════════════════════════════╡
│ count      ┆ 1000492      ┆ 1000492    ┆ 1.000492e6 ┆ 1000492     ┆ 1000492                    │
│ null_count ┆ 0            ┆ 0          ┆ 0.0        ┆ 0           ┆ 0                          │
└────────────┴──────────────┴────────────┴────────────┴─────────────┴────────────────────────────┘

It's also worth noting that to remove NaN values in floating-point columns, you should use the drop_nans() method.

Removing Duplicate Records

The simplest case of duplicate records occurs when all values of one record are identical to another. In our case, duplicates might arise if the same action is recorded multiple times for the same action type in the same online store at a single point in time. I will keep only the most recent value in case duplicates are found.

To remove duplicate records in Pandas, use the drop_duplicates() method, and in Polars, the unique() method.

# Pandas
user_actions_pd_df.drop_duplicates(
    subset=["online_store", "action_type", "action_dt"],
    keep="last",
    inplace=True,
)
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   online_store  907246 non-null  object        
 1   product_id    907246 non-null  object        
 2   quantity      907246 non-null  Int64         
 3   action_type   907246 non-null  object        
 4   action_dt     907246 non-null  datetime64[ns]
dtypes: Int64(1), datetime64[ns](1), object(3)
# Polars
user_actions_pl_df = user_actions_pl_df.unique(
    subset=["online_store", "action_type", "action_dt"],
    keep="last",
)
┌────────────┬──────────────┬────────────┬──────────┬─────────────┬────────────────────────────┐
│ statistic  ┆ online_store ┆ product_id ┆ quantity ┆ action_type ┆ action_dt                  │
│ ---        ┆ ---          ┆ ---        ┆ ---      ┆ ---         ┆ ---                        │
│ str        ┆ str          ┆ str        ┆ f64      ┆ str         ┆ str                        │
╞════════════╪══════════════╪════════════╪══════════╪═════════════╪════════════════════════════╡
│ count      ┆ 907246       ┆ 907246     ┆ 907246.0 ┆ 907246      ┆ 907246                     │
│ null_count ┆ 0            ┆ 0          ┆ 0.0      ┆ 0           ┆ 0                          │
└────────────┴──────────────┴────────────┴──────────┴─────────────┴────────────────────────────┘

Filtering Data

After the data cleaning phase, we need to filter the relevant data for future calculations. In Polars, this is done using the method with a quite descriptive name, filter().

Rows where the filter does not evaluate to True are discarded, including nulls.

# Pandas
user_actions_pd_df = user_actions_pd_df.loc[
    user_actions_pd_df["action_type"] == "purchase"
]
Int64Index: 262237 entries, 11 to 995
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   online_store  262237 non-null  object        
 1   product_id    262237 non-null  object        
 2   quantity      262237 non-null  Int64         
 3   action_type   262237 non-null  object        
 4   action_dt     262237 non-null  datetime64[ns]
dtypes: Int64(1), datetime64[ns](1), object(3)
# Polars
user_actions_pl_df = user_actions_pl_df.filter(
    pl.col("action_type") == "purchase"
)
┌────────────┬──────────────┬────────────┬──────────┬─────────────┬────────────────────────────┐
│ statistic  ┆ online_store ┆ product_id ┆ quantity ┆ action_type ┆ action_dt                  │
│ ---        ┆ ---          ┆ ---        ┆ ---      ┆ ---         ┆ ---                        │
│ str        ┆ str          ┆ str        ┆ f64      ┆ str         ┆ str                        │
╞════════════╪══════════════╪════════════╪══════════╪═════════════╪════════════════════════════╡
│ count      ┆ 262237       ┆ 262237     ┆ 262237.0 ┆ 262237      ┆ 262237                     │
│ null_count ┆ 0            ┆ 0          ┆ 0.0      ┆ 0           ┆ 0                          │
└────────────┴──────────────┴────────────┴──────────┴─────────────┴────────────────────────────┘

Selecting Required Columns

After filtering the data, you may need to retain only the columns relevant for further analysis. In Polars, this is achieved using the select() method.

# Pandas
user_actions_pd_df = user_actions_pd_df[
    ["online_store", "action_type", "product_id", "quantity"]
]
# Polars
user_actions_pl_df = user_actions_pl_df.select(
    "online_store", "action_type", "product_id", "quantity"
)

Grouping Data

After preparing the data, we can aggregate it to get the sum of quantity for each online store and product. I will also retain action_type for further steps. We use the group_by() method in Polars, which is similar to the groupby() method in Pandas.

# Pandas
user_actions_pd_df = (
    user_actions_pd_df.groupby(["online_store", "product_id", "action_type"])
    .agg({"quantity": "sum"})
    .reset_index()
)
  online_store product_id action_type  quantity
0        Shop1       0001    purchase     57772
1        Shop1       0002    purchase     58015
2        Shop1       0003    purchase     58242
3        Shop2       0001    purchase     58256
4        Shop2       0002    purchase     58648
5        Shop2       0003    purchase     58458
6        Shop3       0001    purchase     57891
7        Shop3       0002    purchase     58326
8        Shop3       0003    purchase     59107
# Polars
user_actions_pl_df = (
    user_actions_pl_df.group_by(["online_store", "product_id", "action_type"])
    .agg(pl.col("quantity").sum())
)
┌──────────────┬────────────┬─────────────┬──────────┐
│ online_store ┆ product_id ┆ action_type ┆ quantity │
│ ---          ┆ ---        ┆ ---         ┆ ---      │
│ str          ┆ str        ┆ str         ┆ i32      │
╞══════════════╪════════════╪═════════════╪══════════╡
│ Shop1        ┆ 0001       ┆ purchase    ┆ 57772    │
│ Shop1        ┆ 0002       ┆ purchase    ┆ 58015    │
│ Shop1        ┆ 0003       ┆ purchase    ┆ 58242    │
│ Shop2        ┆ 0001       ┆ purchase    ┆ 58256    │
│ Shop2        ┆ 0002       ┆ purchase    ┆ 58648    │
│ Shop2        ┆ 0003       ┆ purchase    ┆ 58458    │
│ Shop3        ┆ 0001       ┆ purchase    ┆ 57891    │
│ Shop3        ┆ 0002       ┆ purchase    ┆ 58326    │
│ Shop3        ┆ 0003       ┆ purchase    ┆ 59107    │
└──────────────┴────────────┴─────────────┴──────────┘

Joining Data with Another DataFrame

To calculate the total purchases, we need to join our data with the price catalog. In Pandas, we have two methods for this, join() and merge(), which differ in their specifics and functionality. In Polars, we use only the join() method.

# Pandas
user_actions_pd_df = user_actions_pd_df.merge(product_catalog_pd_df, on='product_id')
  online_store product_id action_type  quantity  price
0        Shop1       0001    purchase     57772    100
3        Shop1       0002    purchase     58015     25
6        Shop1       0003    purchase     58242     80
1        Shop2       0001    purchase     58256    100
4        Shop2       0002    purchase     58648     25
7        Shop2       0003    purchase     58458     80
2        Shop3       0001    purchase     57891    100
5        Shop3       0002    purchase     58326     25
8        Shop3       0003    purchase     59107     80
# Polars
user_actions_pl_df = user_actions_pl_df.join(product_catalog_pl_df, on='product_id')
┌──────────────┬────────────┬─────────────┬──────────┬───────┐
│ online_store ┆ product_id ┆ action_type ┆ quantity ┆ price │
│ ---          ┆ ---        ┆ ---         ┆ ---      ┆ ---   │
│ str          ┆ str        ┆ str         ┆ i32      ┆ i64   │
╞══════════════╪════════════╪═════════════╪══════════╪═══════╡
│ Shop1        ┆ 0001       ┆ purchase    ┆ 57772    ┆ 100   │
│ Shop1        ┆ 0002       ┆ purchase    ┆ 58015    ┆ 25    │
│ Shop1        ┆ 0003       ┆ purchase    ┆ 58242    ┆ 80    │
│ Shop2        ┆ 0001       ┆ purchase    ┆ 58256    ┆ 100   │
│ Shop2        ┆ 0002       ┆ purchase    ┆ 58648    ┆ 25    │
│ Shop2        ┆ 0003       ┆ purchase    ┆ 58458    ┆ 80    │
│ Shop3        ┆ 0001       ┆ purchase    ┆ 57891    ┆ 100   │
│ Shop3        ┆ 0002       ┆ purchase    ┆ 58326    ┆ 25    │
│ Shop3        ┆ 0003       ┆ purchase    ┆ 59107    ┆ 80    │
└──────────────┴────────────┴─────────────┴──────────┴───────┘

In Polars, the how parameter supports the following values: {'inner', 'left', 'right', 'full', 'semi', 'anti', 'cross'} .

Calculating a New Column

To calculate a new column or modify an existing column, Polars uses the with_columns() method. To set an alias for a column, you can use alias().

# Pandas
user_actions_pd_df["total"] = (
    user_actions_pd_df["price"] * user_actions_pd_df["quantity"]
)
user_actions_pd_df = user_actions_pd_df[
    ["online_store", "action_type", "total"]
]
# Polars
user_actions_pl_df = user_actions_pl_df.with_columns(
    (pl.col("price") * pl.col("quantity")).alias("total")
)
user_actions_pl_df = user_actions_pl_df.select(
    "online_store", "action_type", "total"
)

Alternatively, you can calculate a new column directly within the select() method:

# Polars
user_actions_pl_df = user_actions_pl_df.select(
    "online_store",
    "action_type",
    (pl.col("price") * pl.col("quantity")).alias("total"),
)

Creating a Pivot Table

Our final step is to create a pivot table. We have already calculated the total sales for each product, and now we will easily calculate the total sales for each online store. In Pandas, we use the pivot_table() method, which allows for the application of aggregate functions. In Polars, we use the pivot() method to create the pivot table.

# Pandas
result_pd = user_actions_pd_df.pivot_table(
    columns="online_store",
    index="action_type",
    values="total",
    aggfunc="sum",
)
online_store     Shop1     Shop2     Shop3
action_type                               
purchase      11886935  11968440  11975810
# Polars
result_pl = user_actions_pl_df.pivot(
    columns="online_store",
    index="action_type",
    values="total",
    aggregate_function="sum",
)
┌─────────────┬──────────┬──────────┬──────────┐
│ action_type ┆ Shop1    ┆ Shop2    ┆ Shop3    │
│ ---         ┆ ---      ┆ ---      ┆ ---      │
│ str         ┆ i64      ┆ i64      ┆ i64      │
╞═════════════╪══════════╪══════════╪══════════╡
│ purchase    ┆ 11886935 ┆ 11968440 ┆ 11975810 │
└─────────────┴──────────┴──────────┴──────────┘

DeprecationWarning: The argument columns for pl.DataFrame.pivotis deprecated. It has been renamed toon`.

Here we are, concluding our little journey. As we can see, the results for both Pandas and Polars match. Everyone who made it to this point is great and incredibly hardworking – you will succeed!

Summary

In this article, we explored Polars using practical examples and comparisons with Pandas. I demonstrated how to handle data preparation, descriptive statistics, missing values, duplicates, filtering, column selection, grouping, merging, and pivot tables. By showcasing these tasks with both Pandas and Polars, I highlighted the ease of using Polars and transitioning to it from Pandas. This guide serves as a practical introduction to leveraging Polars for efficient data analysis.

Recommended read:

Thanks for Reading!

If you enjoyed this article and want to support my work, the best way is to follow me on Medium. Let's connect on LinkedIn if you're also interested in working with data like I am. Your claps are greatly appreciated – they help me know how useful this post was for you.

Tags: Data Science Pandas Polars Python Tips And Tricks

Comment