11 Useful Pandas Functionalities You Might Have Overlooked

Author:Murphy  |  View: 29954  |  Time: 2025-03-23 18:51:57

I am quite sure that Pandas does not need an introduction. In the third article of the series, we will continue to explore some of the useful functionalities of pandas that you might not have heard about.

In case you want to check out the previous parts, you can find them below:

9 Useful Pandas Methods You Might Have Not Heard About

8 More Useful Pandas Functionalities For Your Analyses

Let's dive right into it!

A short setup

As we will be exploring the functionalities of pandas using simple toy examples, we do not need a lengthy list of libraries for this task.

Python">import pandas as pd
import numpy as np

Now we are ready to explore some of the hidden gems of pandas.

1. nth

As the name suggests, this method returns the nth row (or rows). Using Python's indexing convention, nth(0) returns the first row.

While this approach might seem very similar to the head/tail methods, we can use the nth method for a more flexible selection. For example, we can use it if we need to access the first and third rows within a group:

Another similarity you might notice is that to the first/last methods, which return the first/last non-null value within each group. The key component of the previous sentence is non-null. Using the same grouping condition, the first method returns the following values:

The difference in behavior between the nth method and the first/last methods is clearly visible for the a value of the x column. By using the first method, we are skipping the missing value in the first row.

2. pop

This simple method removes a single column from a DataFrame and stores it as a new Series object.

If we inspect the X object after running the snippet, it will be a DataFrame with a single column called x.

3. compare

While manipulating data, we often want to identify the differences between two DataFrames. To do just that, we can use the compare method.

The output shows the identified differences. The only difference occurs in the 4th row, in which the value in the first DataFrame (referred to as self) is 4, while in the second one (referred to as other) it is 9.

As the method returns the differences, running the following snippet returns nothing.

a.compare(a)

It is worth mentioning that there are quite a few arguments of the compare method which we can use to modify the method's behavior. For example, we can use the keep_shape argument to indicate that we want to keep the original shape of the DataFrames and replace all identical values with NaNs.

If we additionally specify the keep_equal argument, the NaN values used to mark the identical values will be replaced with the actual values.

One thing to keep in mind is that we can only compare DataFrames that are identically labeled and shaped.

4. align

align is a useful method we can use to align two DataFrames on their axes by using a specified join type. It will be easier to understand what it means with an example:

First, we have a DataFrame called X, from which we remove a single column called y. In the ML context, you might think of those as features and the target, but that is not the only possible interpretation.

Coming back to the example, we then manually filtered the X DataFrame, which caused the objects not to be aligned anymore – the y Series has more observations than the X DataFrame.

To once again align the object using their indices, we use the align method. In this case, we would like to have matching indices, so we opted for an inner join.

Now, both objects have 3 rows with the same indices.

5. to_markdown

Oftentimes, we want to embed a DataFrame into a report or paper. We can easily do that using the to_markdown method.

We can use the tablefmt argument to choose one of the 30+ available table formats. For a complete list of available formats, please refer the documentation. In the following snippet, we generate one of the LaTeX-compatible table types.

6. convert_dtypes

pandas offers many options to handle data type conversions. In my previous article, I explained how to leverage certain data types to optimize for memory usage. In that approach, we had to choose the data types manually.

Alternatively, we can use the convert_dtypes method, which converts columns to the (hopefully) best suitable data type. To manage expectations, pandas will try to figure out the best data type to represent a given column, which does not necessarily mean that it will optimize for memory usage. So if memory is the issue, we will probably still have to indicate the data types ourselves.

Coming back to the example, let's first define a DataFrame with columns containing different data types. The columns will also contain some missing values.

Then, let's inspect the data types.

Now, let's use the convert_dtypes method on our DataFrame.

By inspecting the output, we can also see that pandas now supports missing values for integer columns (using pd.NA), so they no longer need to be represented as floats.

Let's again inspect the data types. We can see that the data types were adjusted by the convert_dtypes method.

The convert_dtypes method also has useful arguments which we can use to fine-tune its behavior. For example, we can use the convert_boolean argument to indicate that we prefer to keep boolean columns encoded using ones and zeros instead of the True/False values.

We can see that the d column now contains an integer representation of the boolean values.

7. ordered CategoricalDtype

Speaking of data types, we already know that we can use the category data type to store repeating string values in a more performant way, that is, to save a lot of memory.

To convert a string column into a categorical one, we can use the following snippet:

df["x"].astype("category")

To take it one step further, we can also use ordered categorical data types. For example, let's imagine that we are working for a clothing company and we have items in 5 sizes.

To encode that information while keeping the logic of the sizes, we can create a custom CategoricalDtype data type with the correct order of the categorical values and specify the ordered argument as True. Finally, we can convert a column to the newly created data type using the familiar astype method.

What is the benefit of using the ordered categorical data type? For example, we can easily sort by that category and obtain relevant sorting instead of using the alphabetical order of the string names.

Additionally, we can also use ordered categorical data type for more relevant filtering. In the following snippet, we want to keep only sizes larger than M.

8. SparseDtype

We have already discussed using categorical data type as a potential way of optimizing for memory used by pandas. Another way of doing that is by leveraging the sparse data type.

For example, we can have numerical columns that contain mostly zeroes. We can significantly reduce the memory consumption by treating such columns as sparse. And to be precise, the majority of the values does not need to be zeros, it could be represented by NaNs or any other value. As long as that single value is repeated frequently.

Under the hood, sparse objects are compressed in such a way that any data matching a specific value (0, NaN, or any other value that has the majority) is omitted. To save the space, such compressed values are not actually stored in the array.

Let's generate a massive DataFrame with the majority of values equal to zero.

df = pd.DataFrame(np.random.randint(0, 100, size=(10000000, 5)))
df[df <= 90] = 0

Then, we define a helper function used for evaluating the memory consumption of a DataFrame:

def memory_usage(df):
    return(round(df.memory_usage(deep=True).sum() / 1024 ** 2, 2))

The initial consumption is:

We can try to downcast the numeric type to the smallest available one – uint8.

Doing so resulted in a 88% reduction of the used memory. As the next step, let's use the sparse data type.

Be leveraging the sparse data type we achieved a 55% reduction from the previous solution (using uint8 data type) and a 94% reduction compared to the initial DataFrame.

9. crosstab

Aggregating is a great way to summarize vast amounts of data into a comprehensive and informative summary. One of the pandas functions used to aggregate data is crosstab. By default, it counts the occurrences of certain combinations of values in the columns of a DataFrame.

First, let's generate a DataFrame with some categorical and numerical values.

Using the simplest form of the crosstab function generates the following table of the value counts between the region and category columns.

We can take it a step further and aggregate over multiple categories at once.

We can also display the row-wise and column-wise totals by using the margins argument.

If we are not interested in the counts but the distribution, we can use the normalize argument to show the percentages.

We can also combine the margins and normalize arguments so that the normalization is done across rows or columns. To do that, we should pass "index" or "columns" to the normalize argument, while margins is set to True.

Lastly, we can also use crosstab to aggregate numeric data. Using the following snippet, we calculate the average sales over the regions and categories.

10. swaplevel

To be honest, I am not fan of working with MultiIndex, as it always results in a small headache. However, swaplevel is one of the methods that can make working with MultiIndex easier. What it does is it simply swaps the position of the indices within a MultiIndex.

Let's re-use some of the code from the crosstab example and generate a simple aggregation table.

If we want to switch the position of the indices, we can use the swaplevels method.

For more complex cases, we can provide the integer location of the indices we want to swap and also the axis (rows or columns) over which we want to swap the indices.

Naturally, this was a simplified example and we could have just swapped the indices by changing the order of the columns in the crosstab function. But I hope you get the idea

Tags: Data Science Education Machine Learning Pandas Python

Comment