Chaining Pandas Operations: Strengths and Limitations
PYTHON PROGRAMMING

The title of this article stresses the strengths and limitations of chaining Pandas operations – but to be honest, I will write about fun.
Why fun? Is it at all important when we have data to analyze?
I don't know what works for you, but for me fun in work is important. During my 20+ years of experience in Data Science, I've found that the more enjoyment I derive from coding, the more satisfied I am from completing the task. And I do mean the process of pursuing the task, not only just completing it. Of course, achieving results matters, probably the most. But trust me, if you dislike the tools you're using, all you'll want is to finish the job as quickly as possible. This can lead to mistakes, as you might work hastily and overlook important details in the data. And that's something you want to avoid.
I transitioned to Python from R, and analyzing data with R is a lot of fun – thanks to the dplyr
syntax. I've always enjoyed it, and I still do. However, when I switched to Python, I found myself preferring it over R. I've never really enjoyed programming in R (note the distinction between analyzing data and programming), while programming in Python is a lot of fun to me. Sure, this is subjective, I'm aware of that. But that doesn't change a thing.
There was one exception, though: Python's Pandas versus R's dplyr
. There was no comparison – dplyr
won in every aspect of data analysis. I didn't like Pandas at all.
Have you noticed the past tense? "Was one exception." "Was no comparison." "Won in every aspect." "Didn't like Pandas."
I used the past tense because things changed – I actually do like Pandas now. I'm even starting to think that I prefer it over dplyr
.
So, what changed?
Pandas pipes
Some time ago, I learned something new about Pandas – something that blew my mind. I learned how to chain Pandas operations in order to create pipelines of them.
It's high time I show you some code, otherwise, pennies to donuts, I wouldn't convince anyone. I will use a dataframe created in the following way:
>>> import pandas as pd
>>> def make_df():
... return pd.DataFrame({
... "x": [1.]*5 + [1.2]*5 + [3.]* 5 + [1, 5, 2, 3, 3],
... "y": [20.]*5 + [21.5]*5 + [35.]* 5 + [41, 15, 22, 13, 13],
... "group": ["a"]*5 + ["b"]*5 + ["c"]* 5 + ["d"]*5,
... })
>>> df = make_df()
I need this function in order to reuse the original dataframe in subsequent examples.
The pipe operator
Pipes are what make R's dplyr
so powerful. Using the %>%
operator, you can chain command after command, and such pipes can be very long – but still very readable. And it's this readability that makes the difference. That's what makes dplyr
code so powerful.
Typical Pandas code uses brackets and assignments. So, you write line after line of such bracket assignments, and the resulting code looks rather clumsy and overwhelming. And no pipes. Let's see how a simple example:
>>> df["z"] = df["x"] * df["y"]
>>> sum(df[df["group"] == "a"]["z"])
100.0
This is typical Pandas code. You will find it in millions of notebooks and scripts across the world, especially those written by data scientists.
But this is just one style of writing Pandas code. The other style is… pipes! While dplyr
uses the %>%
operator for chaining, Pandas uses the dot operator (.
)— the very typical Python operator for accessing attributes of an object. So, everyone who knows Python already knows how to use the dot operator for pipes. You just need to know a little bit about how to do so in Pandas.
Methods for chaining
Of course, it's not only about chaining operations – it's also about methods to chain. First of all, many pd.DataFrame
methods can be chained, such as head()
, sum()
, isna()
, astype()
. Some other methods, however, don't naturally fit into the piping approach. Examples constitute pivot()
, which reshapes the data frame, and groupby()
, which performs aggregates.
Some methods, however, were designed exactly for this very purpose: to be chained. So they can be used instead of code that cannot be chained—so, instead of bracket assignment. Code consisting of a chain of Pandas operations can be much more readable than a sequence of bracket assignments. Surely, not to someone who doesn't know this type of syntax, but you don't need much time to get used to it.
The methods that I use most often in place of bracket assignments are:
filter()
: to subset a data frame's rows or columns according to the specified index labelsdrop()
: to remove rows or columns using label names and the corresponding axis (rows or columns); alternatively, or by directly specifying index or column namesquery()
: to query the columns of a data frame with a boolean expression provided as a stringassign()
: to assign (create) new columns to a data framesqueeze()
: to squeeze one-dimensional axis objects into scalarspipe()
: to call custom functions that take apd.Series
orpd.DataFrame
and return onewhere
: to replace values where the condition is False; actually, I consider this method more difficult to grasp and to use than the ones mentioned aboveapply
: to use an external function for a dataframe's column
Before I started chaining Pandas operations, I used only the .apply()
method from this list. Be aware that they can be used without pipelining operations, so one after another, line after line, using assignments. A pipe , however, enables to chain these methods in one command.
For me, the above methods constitute the most important part of the Pandas piping API. We'll see examples of how to use them soon. While these methods have proven to be most useful in my projects, there are others that can be quite useful, too; for example:
dropna()
fillna()
rename()
sort_values()
value_counts()
These are very useful methods, and I guess if you're a Pandas user, you've used at least some of them.
Learn the basics
If you'd like to learn everything that Pandas offers to create pipes, don't. Well, at least not at once. You can do it step by step.
For the moment, just learn the basics, and you'll most probably see – I hope so – how powerful such syntax can be. Once you know the basics and know that this is your thing, you can learn more advanced tools.
In this article, I will show you the basics only – but you will see that these basic chained operations can help you create very readable code, more readable than typical Pandas code.
Chaining Pandas operations
Okay, it's high time to see what I mean. Let's return to this code:
>>> df["z"] = df["x"] * df["y"]
>>> sum(df[df["group"] == "a"]["z"])
100.0
and reimplement it using piping methods. This time, we will need to use two lines of operations, as we cannot create a new column and make calculations on it in the same pipe.
>>> df = make_df()
>>> df = df.assign(z=df.x * df.y)
>>> df.query("group == 'a'").z.sum()
100.0
Note that I changed df["x"]
to df.x
, but only to show it makes no difference – both will work in exactly the same way (if you'd like to learn the subtle differences between these two approaches, read this article).
If we only want to calculate the sum, we can do this using one chain:
>>> df = make_df()
>>> df.assign(z=df.x * df.y).query("group == 'a'").z.sum()
100.0
Remember, however, that this way the z
column will be created on the fly, so the df
column won't contain it once the operation returns:
>>> df.columns
Index(['x', 'y', 'group'], dtype='object')
The query()
method filters the dataframe based on a Boolean expression; here, we want to keep the rows in which group
is "a"
. Then we take the z
column and calculate its sum.
Let's return for the moment to the assign()
method. It creates a new column. Here, we used vectorized Pandas operations, which are fast. Sometimes, however, they won't work due to the complexity of the logic we need to use to create the column. In that case, you can use a regular Python list comprehension:
>>> df.assign(z = [
... a + b if a < 2 and b > 20 else a * b
... for a, b in zip(df["x"], df["y"])
... ]).iloc[[0, 6, 10]]
x y group z
0 1.0 20.0 a 20.0
6 1.2 21.5 b 22.7
10 3.0 35.0 c 105.0
Above, we used a simple example of a short chain of Pandas operations. The beauty of this syntax comes with long chains. Let's consider the following code:
>>> df = make_df()
>>> filtered_df = df[df["x"] > 1].copy()
>>> filtered_df["w"] = filtered_df["x"] ** 2
>>> grouped_df = filtered_df.groupby("group").agg({"y": "mean", "w": "mean"})
>>> grouped_df = grouped_df.reset_index()
>>> result = grouped_df.sort_values(by="y", ascending=False)
>>> result
group y w
1 c 35.00 9.00
0 b 21.50 1.44
2 d 15.75 11.75
Note the use of the .copy()
method of the pd.DataFrame
object. This method is necessary because, without it, we would be working on a view of the original DataFrame rather than a separate copy. Modifying a view can lead to unexpected results, which is why Pandas issues a SettingWithCopyWarning
warning when attempting to change values in a view. By making a copy, we ensure that our operations do not affect the original dataframe and avoid this warning.
Let's analyze the subsequent steps of this operation:
filtered_df = df[df["x"] > 1]
: Filtersdf
to include only rows wherex
is greater than 1.filtered_df["w"] = filtered_df["x"] ** 2
: Adds a new columnw
tofiltered_df
, which is the square ofx
.grouped_df = filtered_df.groupby("group").agg({"y": "mean", "w": "mean"})
: Groupsfiltered_df
by thegroup
column and calculates the mean ofy
andw
for each group.grouped_df = grouped_df.reset_index()
: Resets the index ofgrouped_df
to convert the group labels back into a column.result = grouped_df.sort_values(by="y", ascending=False)
: Sorts the resulting dataframe by the mean value ofy
in descending order.
Let's refactor this code into one chain of piped Pandas operations:
>>> result_chain = (
... df[df["x"] > 1]
... .assign(w=lambda df: df["x"] ** 2)
... .groupby('group')
... .agg({"y": "mean", "w": "mean"})
... .reset_index()
... .sort_values(by="y", ascending=False)
... )
>>> result_chain
group y w
1 c 35.00 9.00
0 b 21.50 1.44
2 d 15.75 11.75
As you see, we got the very same result, which we can confirm in the following way:
>>> result_chain.equals(result)
True
Let's analyze the chained code:
df[df["x"] > 1]
: Filtersdf
to include only rows wherex
is greater than 1..assign(w=lambda df: df["x"] ** 2)
: Adds a new columnw
, which is the square ofx
..groupby("group")
: Groups the resulting dataframe by thegroup
column..agg({"y": "mean", "w": "mean"})
: Calculates the mean ofy
andw
for each group of the grouped dataframe..reset_index()
: Resets the index to convert the group labels back into a column..sort_values(by="y", ascending=False)
: Sorts the resulting dataframe by the mean value ofy
in descending order.
As you see, both explanations are basically the same, showing that these two blocks of code do the same, using the corresponding operations one after another. The only difference lies in chaining the operations in a single pipe in the latter code.
Note that in the long pipe code, we're creating new dataframes at each step of the pipeline without modifying the original dataframe (df
). This avoids the SettingWithCopyWarning
because each operation in the pipeline returns a new dataframe rather than modifying an existing view – hence no need of the copy()
method.
For filtering, I often use the .query()
method instead of typical Pandas filtering. It can be slower, but it's more readable to me:
>>> (
... df
... .query("x > 1")
... .assign(w=lambda df: df["x"] ** 2)
... .groupby("group")
... .agg({"y": "mean", "w": "mean"})
... .reset_index()
... .sort_values(by="y", ascending=False)
... )
group y w
1 c 35.00 9.00
0 b 21.50 1.44
2 d 15.75 11.75
The only difference here is the line with the .query()
method. Here, the filtering is pretty simple, but sometimes, when it gets quite complicated, formulating filtering conditions that way looks just more natural – and is easier to read – than filtering using brackets.
Which of the two versions do you find more readable? To me, the piped version is more readable than the typical step-by-step approach. This is because it allows for a clear, smooth, linear flow of operations. Each step is chained together in a sequence, making it easy to follow the data's operations from start to finish. This method reduces the need for intermediate variables, which can clutter the code and make it harder to understand.
Using pipes encourages writing concise and expressive code. This not only improves readability, but also facilitates maintenance and debugging, as the entire sequence of operations is visible at a glance. When you have several temporary variables, as in the non-piped example above, the reader of the code may wonder whether some of them will be used later on or are used only in this particular place.
Simply put, the piped approach reduces unnecessary code and enhances the logical flow of data operations. This makes the code more intuitive and easier to grasp.
You can join both types of syntax
Sometimes it's just easier to use the typical Pandas syntax. If you know how to do something using this approach, go ahead and do it. You can stop a pipe, use bracket assignment, and then use the resulting object to start a new pipe. I do this from time to time, especially when I don't have enough time to find a piping solution. However, when I do find the time later, I try to revisit and find the piped solution.
This is actually how I'd suggest using pipes in Pandas. While pipes are smoother and more readable, breaking them occasionally isn't a problem. As you can see, I am not a radical.
Remember, though, that when you can use a chained operation, the code will likely be simpler and more readable. This is because Pandas operations chained in pipes are often more readable than regular Pandas syntax.
Readability is a great virtue of any code. But it's not the only one. As you will see in the next section, piped operations can sometimes be slower than typical Pandas code.
Performance
How should you decide which Pandas syntax to use based on performance?
That depends. If it's analysis code that no one will see – use whatever syntax you prefer; the better you know it and the more comfortable you feel using it, the better. When you're working on a data product where performance is of the highest importance, with even the slightest gains making a difference (this is often the case when working on dashboards), then you definitely should use a faster approach. In any other situation, you have to weigh the pros and cons and choose the approach that'll work best under the given circumstances.
To make such decisions, you should have at least basic knowledge about the performance of the different types of syntax. Here, we'll conduct some simple benchmarking experiments, but be aware that they will cover just some basic scenarios. When working on your code, you should make thoughtful decisions. Often, this will mean profiling the code and checking whether a particular part (e.g., a chain of Pandas operations) is a bottleneck or not.
To benchmark regular Pandas code and the corresponding pipe-based code, we'll use the timeit
module (you can read about it here). I will run the benchmarks on a Windows 10 machine, 32 GB of RAM, 4 physical and 8 logical cores.
See the Appendix for the benchmarking code. We will change the code_regular
and code_pipe
variables, which contain the two pieces of code to be benchmarked. Let's start with those contained in the Appendix, that is:
code_pipe = f'df = df.assign(xy = df["x"] * 3 / (1 + df["y"]*.3))'
code_regular = f'df["xy"] = (df["x"] * 3) / (1 + df["y"]*.3)'
This code compares the performance of the pd.DataFrame.assign()
method against the traditional vectorized Pandas assignment. Let's see the results:
Benchmarking with df of 100 rows and repeat = 10000
Pipe:
mean_time(t) = 0.0004555
min_time(t) = 0.0004289
Regular code:
mean_time(t) = 0.0004164
min_time(t) = 0.0003984
Benchmarking with df of 1000 rows and repeat = 10000
Pipe:
mean_time(t) = 0.0005324
min_time(t) = 0.000495
Regular code:
mean_time(t) = 0.0004764
min_time(t) = 0.0004467
Benchmarking with df of 10000 rows and repeat = 10000
Pipe:
mean_time(t) = 0.0012987
min_time(t) = 0.0011558
Regular code:
mean_time(t) = 0.0006378
min_time(t) = 0.0005557
Benchmarking with df of 100000 rows and repeat = 1000
Pipe:
mean_time(t) = 0.0066507
min_time(t) = 0.0060729
Regular code:
mean_time(t) = 0.001402
min_time(t) = 0.0010576
Benchmarking with df of 1000000 rows and repeat = 10
Pipe:
mean_time(t) = 0.0694944
min_time(t) = 0.0687732
Regular code:
mean_time(t) = 0.011838
min_time(t) = 0.0114591
As we can see, the .assign()
method is about 50–60% slower than the vectorized Pandas code. The difference doesn't seem to depend on the size of the dataframe.
Above, we benchmarked creating a new column. Now let's benchmark filtering columns:
code_pipe = f'df_ = df.filter(["a", "b", "c", "d", "e"])'
code_regular = f'df_ = df.loc[:, ["a", "b", "c", "d", "e"]]'
And here are the results:
Benchmarking with df of 100 rows and repeat = 10000
Pipe:
mean_time(t) = 0.0002483
min_time(t) = 0.0002267
Regular code:
mean_time(t) = 0.0002808
min_time(t) = 0.0002459
Benchmarking with df of 1000 rows and repeat = 10000
Pipe:
mean_time(t) = 0.0002972
min_time(t) = 0.0002681
Regular code:
mean_time(t) = 0.0003176
min_time(t) = 0.0002926
Benchmarking with df of 10000 rows and repeat = 10000
Pipe:
mean_time(t) = 0.0004878
min_time(t) = 0.0003939
Regular code:
mean_time(t) = 0.0005025
min_time(t) = 0.0004806
Benchmarking with df of 100000 rows and repeat = 1000
Pipe:
mean_time(t) = 0.0013341
min_time(t) = 0.0011952
Regular code:
mean_time(t) = 0.0013245
min_time(t) = 0.0011956
Benchmarking with df of 1000000 rows and repeat = 100
Pipe:
mean_time(t) = 0.0094321
min_time(t) = 0.0090392
Regular code:
mean_time(t) = 0.0095467
min_time(t) = 0.0087992
These results are slightly different: for most dataframes, the .filter()
method was slightly slower. For dataframes of 100,000 rows, it was even slightly faster.
Now, let's filter rows:
code_pipe = f'''df_ = df.query("a >= @mean_a")'''
code_regular = f'df_ = df[df["a"] >= mean_a]'
Here are the results:
Benchmarking with df of 100 rows and repeat = 10000
Pipe:
mean_time(t) = 0.0017628
min_time(t) = 0.0016239
Regular code:
mean_time(t) = 0.0001927
min_time(t) = 0.0001791
Benchmarking with df of 1000 rows and repeat = 10000
Pipe:
mean_time(t) = 0.0018353
min_time(t) = 0.0017287
Regular code:
mean_time(t) = 0.0002581
min_time(t) = 0.0002381
Benchmarking with df of 10000 rows and repeat = 10000
Pipe:
mean_time(t) = 0.0024342
min_time(t) = 0.0021453
Regular code:
mean_time(t) = 0.0007169
min_time(t) = 0.0004661
Benchmarking with df of 100000 rows and repeat = 1000
Pipe:
mean_time(t) = 0.00625
min_time(t) = 0.0046697
Regular code:
mean_time(t) = 0.0035322
min_time(t) = 0.0028939
Benchmarking with df of 1000000 rows and repeat = 10
Pipe:
mean_time(t) = 0.0471684
min_time(t) = 0.041649
Regular code:
mean_time(t) = 0.0343789
min_time(t) = 0.0331603
The results are quite similar: in most experiments, the .query()
method was slower, with the biggest difference in performance observed for dataframes of 100,000 rows (over twice as slow).
Of course, we need to remember that such benchmarks are affected by multiple factors, and in another round of experiments, we could observe slightly different results. Therefore, for example, we should not conclude that the difference is bigger for dataframes of 100,000 rows than for a million rows, as this could change in another round of experiments. We can conclude, however, that the three piped Pandas methods are usually significantly slower than the corresponding typical Pandas code. Sometimes the difference is quite visible (e.g., the piped function is twice as slow), but usually, the difference in performance is less pronounced.
We've benchmarked the three most important piped Pandas methods, namely, .assign()
, .filter()
, and .query()
. Indeed, when I'm creating Pandas pipes, these are the three methods I use most often.
We should treat these results with great caution. They only show that chained Pandas methods are generally slower, sometimes significantly, sometimes negligibly. Nevertheless, in real-life projects, especially when performance is at stake, you should profile your code and see how the particular set of operations performs. A general recommendation for optimizing performance is as follows: while you can make some general assumptions (e.g., chained Pandas operations will most likely be slightly slower), you should always profile your code to see if a particular set of operations creates a bottleneck or not.
These benchmarks focused solely on execution time. This is not because I ignore memory performance, but because my experiments showed that both approaches have more or less the same memory footprints. This doesn't mean that memory footprints will always be the same in the case of long chains – or even different methods. As with execution time, it's always good to double-check memory usage. For this, you can use the pd.DataFrame.memory_usage()
method, one of the Python memory-profiling packages (e.g., psutil
, memory_profiler
, tracemalloc
), or even IPython's ipython_memory_usage
extension.
Conclusion
We've discussed the various pros and cons of piping Pandas operations. So, should you use them instead of typical Pandas code? As often, the answer depends on the context. While making a decision, you should consider various aspects of the project, such as:
- Type of code: Is it for analysis, proof of concept (PoC), or production?
- Performance vs. readability: Is performance more crucial than readability?
- Comfort and preference: How comfortable are you with each syntax? Which do you prefer?
- Audience: Who will read the code?
These questions are interrelated, and their answers should be considered together. Here are some guidelines:
- Analysis code: Write in the style you are most comfortable with if the code won't be shared. If it will be shared (e.g., using notebooks), consider the audience (see below). There can be exceptions, however. When you're working with big data, you may need to optimize the code in terms of performance (in terms of time or memory or both, depending on the situation).
- Production code: Balance readability and performance.
- Readability focus: Piping can provide clearer code. On the other hand, if a code reader is unfamiliar with this syntax, they may need to spend some time learning what particular piped methods do and how they work.
- Performance focus: Prefer vectorized operations over piping. Nonetheless, remember to profile the code before making significant changes to improve performance.
- Personal preference: Use piping if it feels natural to you; avoid it if it feels unnatural – unless you want to learn it, in which case a real-life project offers a perfect opportunity to do so.
- Educational code: Consider the project's context to decide between simplicity, readability, and education.
Weigh the pros and cons for your project to decide whether it's piping time. Remember, however, that piping is integral to Python, as Pandas isn't the only framework that enables you to pipe operations. When using Python OOP, you frequently use the dot operator to chain class methods. Thus, don't treat Pandas piping as something atypical or unnatural. The truth is, piping methods is at least as natural for Python as assigning values to temporary variables in consecutive operations.
Footnotes
¹ Actually, the pipe operator, %>%
, originates from the [magrittr](https://magrittr.[tidyverse](https://www.tidyverse.org/).org/)
package, which is now part of the tidyverse
environment.
Appendix
The code for benchmarks. The two pieces of code to benchmark are kept in the code_regular
and code_pipe
variables. The code is run from the shell using two command-line variables:
- The number of rows in the dataframe.
- The
repeat
value passed totimeit.repeat
(here kept in therep
variable).
For example, this call:
> python bench.py 1000 100_000
will create a Pandas Dataframe of 1000 rows and will run the benchmark with a repeat
value of 100,000.
Note that the experiments are run using high values of repeat
, but the value of number
is always 1. This may seem atypical, but I did this on purpose. We're benchmarking operations on mutable dataframes, and in some of the operations, we need to assign the result. Hence, to be consistent and make fair comparisons, we're assigning the result in all benchmarked operations.
Normally, we assign the result of such operations to the very same variable (in the code, it's df
). Thus, if the number
value is 2 or higher, df
would be different in the first run of the operations compared to the next runs. However, if we use number
of 1 and a high value of repeat
, the df
is recreated before each run: the setup
code is run, and then the benchmarked code is run only once with the output of the setup code. Hence, each run leads to the very same output, which is a requirement of good benchmarks.
Here's the code:
Python">import sys
import warnings
from timeit import repeat
warnings.filterwarnings("ignore", category=DeprecationWarning)
try:
n = int(sys.argv[1])
except IndexError:
n = 1
try:
number = int(sys.argv[2])
except IndexError:
number = 1000
print(f"Benchmarking with df of {n} rows and repeat = {number}")
setup = f"""
import pandas as pd
df = pd.DataFrame({{
letter: list(range({n}))
for letter in 'abcdefghijklmnopqrstuwxyz'
}})
mean_a = df.a.mean()
"""
import pandas as pd
df = pd.DataFrame({
letter: list(range(10))
for letter in 'abcdefghijklmnopqrstuwxyz'
})
code_pipe = f'df = df.assign(xy = df["x"] * 3 / (1 + df["y"]*.3))'
code_regular = f'df["xy"] = (df["x"] * 3) / (1 + df["y"]*.3)'
kwargs = dict(setup=setup, number=1, repeat=number)
t_pipe = repeat(code_pipe, **kwargs)
t_regular = repeat(code_regular, **kwargs)
def report(t: list[float], comment="", dig=7) -> None:
def mean_time(x): return round(sum(x)/len(x), dig)
def min_time(x): return round(min(x), dig)
print(comment)
print(
f"{mean_time(t) = }"
"n"
f"{min_time(t) = }"
)
report(t_pipe, "Pipe:")
report(t_regular, "Regular code:")