Level Up Your Pandas Game with These 15 Hidden Gems

Author:Murphy  |  View: 21559  |  Time: 2025-03-22 23:07:03

What I genuinely love about Pandas is that you can spend years working with it, and there are probably still quite a few useful methods that you are not aware of. That is why, in the fourth part of the series, I will show you a few more methods you might not have heard about but which will definitely be useful for data wrangling. Before we jump into it, you might want to check out the previous parts of the series:

We will cover the methods in alphabetical order, not in terms of their usefulness. Let's jump right into it!

Setup

This time, we won't need any complex libraries, so we will just import the basics:

To ensure you can follow along, we are using pandas 2.2.0, which is the latest version available at the time of writing this article.

agg

You are probably already familiar with performing aggregations in pandas using methods such as sum or min. You have also probably used these methods in combination with groupby. Therefore, it will not come as a surprise that the agg method is used to perform one or more aggregations on a DataFrame. What is interesting is that we can use agg in a few ways, depending on the syntax we use. Let's illustrate this with some examples.

By passing a dictionary to the agg method, we indicate which aggregations (sum, mean, max, etc.) we want to calculate for each column of the DataFrame. The keys of the dictionary represent the columns on which we want to perform the aggregations, while the values represent the operations we want to execute.

We are not restricted to passing only a single operation for column either. In the next snippet, we indicate that we want to calculate both the sum and mean using column A.

In the resulting DataFrame, we can observe some NaN values, corresponding to combinations of columns and operations that we have not requested in our dictionary.

For the next example, let's make it a bit more complex and add a grouping column. This time, we would like to calculate the mean and sum for all three columns within each of the two groups.

Lastly, we use a slightly different syntax to create new columns with names of our choosing. Naturally, we also indicate which operation we want to execute on each column, all within the groups specified with the groupby clause.

assign

The assign method is used to create a new DataFrame with additional columns, assigning values based on existing columns or operations.

Unless we assign the last operation to an object, the DataFrame df is not modified in place.

I find this method most useful within chained operations, where we are not necessarily interested in the intermediate steps. For example, we would like to aggregate based on the new column, but we do not want to add it to the original DataFrame.

combine_first

The combine_first method is used to combine two Series (or columns in a DataFrame), choosing values from the first Series and filling in any missing values with the corresponding values from the second Series.

The combine_first method in pandas works similarly to the COALESCE function in SQL.

As we can see, the missing values from the s1 Series were filled in with the values from the s2 Series. If the second series had a missing value, we are left with a missing value.

Naturally, this method can also be used with columns of a DataFrame. Let's convert our two Series into a DataFrame and illustrate how the combine_first method works.

We can also chain the combine_firstmethods if we want to use more than 2 columns for the potential updating of the missing values.

As we can see, the last missing value is now gone, as the value is picked up from the third Series.

cumsum / cummin / cummax / cumprod

While you might have heard about the cumsum method, I'd say the chances are high that you have never used the other three. In short, the cumsum, cummin, cummax, and cumprod methods are used to calculate cumulative sums, minimums, maximums, and products for elements in a Series or DataFrame.

Pro tip: Those methods can also be used together with groupby!

cut / qcut

Let's start with the simpler one. By default, the cut function divides the data into bins of equal width. This means that each bin has the same range, but the number of data points in each bin may vary. We can observe this behavior below, where we divide the years of experience into three bins.

As we can see in the output, the cut function categorized all the observations into one of three bins of equal length (approximately 3.33 in length).

One thing that might require a bit of a refresher here is the interval notation. Let's look at some hypothetical numbers to understand the notation:

  • (0, 10): The interval is not closed on either side, meaning it does not include neither 0 nor 10.
  • [0, 10]: The interval is closed on both sides, meaning it includes both 0 and 10.
  • [0, 10): The interval is closed on the left side, meaning 0 is included, but 10 is not included.
  • (0, 10]: The interval is closed on the right side, meaning 0 is not included, but 10 is included.

We have already covered the simplest use case of cut. As the next step, we can also provide custom bin edges. In the snippet below, we define the three seniority levels ourselves. As you can see, when defining the edges ourselves, the resulting bins do not have to be of equal width.

We can see one NaN value in the output. That is because, by definition, the lowest value is not included in the range. We can fix that by setting include_lowest to True.

And to wrap it up, we can also assign custom label names to our bins. As we split the years of experience, let's assign seniority labels to each of the bins.

We have covered the simpler variant, now let's look into qcut. The qcut function divides the data into bins based on quantiles. This ensures that each bin has approximately the same number of data points. In the snippet below, we indicate that we want to split the data based on three quantiles.

As we can see in the output, the 3 bins created using qcut do not have the same width, but exactly 2 observations fall into each of the bins. Similar to cut, we can assign label names.

In summary, cut divides data into bins of equal width, while qcut divides data into bins based on quantiles, ensuring roughly equal-sized bins in terms of the number of data points.

duplicated / drop_duplicates

The duplicated method returns a boolean Series indicating whether each element in a DataFrame is duplicated (True) or not (False).

In the following example, we create a DataFrame with a total of 4 duplicated observations (2 pairs).

Now we can use that flag to filter out duplicated rows or only keep the duplicated ones. Using the default setting (keep="first") of duplicated, we keep the first matched observation, and all the subsequent ones are marked as duplicates (with True). Alternatively, we can keep the last duplicated observation using the following snippet:

Alternatively, we can mark all the duplicated rows:

In this example, we checked complete rows of the DataFrame to see if they are duplicated. Alternatively, we can check for duplicates only in a selection of columns using the subset argument.

We have already seen how to mark the duplicated rows. We can also directly drop the duplicates using the drop_duplicates method.

In this method, you can also use the keep and subset arguments, just as we have explained them in the case of the duplicated method.

isin

The isin method is used to filter Series and DataFrames by selecting observations where values within a given column are present in a specified list or another Series. The method returns a boolean Series showing us whether each observation in the column is within the specified values.

merge_ordered

We can use the merge_ordered function to perform an ordered merge of two DataFrames. This function is particularly useful when we have two DataFrames with sorted keys and we want to merge them while maintaining the order of the keys. This can come in handy when working with time-series data.

There are a few things to notice in the snippet:

  • On purpose, I have inverted both date columns, but the outcome is still sorted based on the date column in ascending order.
  • The default version of the join in merge_ordered is an outer join (instead of the inner join for merge).
  • We can use the fill_method to interpolate missing values using one of the available approaches.

For comparison, you can see below how the combined DataFrame would have looked if we used merge with an the outer join.

pct_change

The pct_change method is used to calculate the percentage change between the current and a prior element in a Series or DataFrame. Most frequently, it is used to analyze the percentage change in values over time, especially in financial time series data, for example, stock prices.

select_dtypes

select_dtypes is used to filter columns in a DataFrame based on their data types. We can use it to select columns that have specific data types, such as numeric types (int64, float64), object types (str, object), boolean or datetime types, etc.

In the first example, we select the integer and float columns, that is, the numeric ones.

Alternatively, we can also select the numeric columns using the "number" keyword.

If we want to select all object types, we can use the "object" keyword.

With select_dtypes, we can also specify which data types we want to exclude. In the next example, we want to select all columns that do not have an object type.

Yet another panda searching for more hidden gems. Image generated with Midjourney.

Wrapping up

I hope you have discovered at least one new functionality of pandas that will make your data wrangling easier and more fun

Tags: Data Analysis Data Science Machine Learning Pandas Python

Comment