Pandas: apply, map or transform?

As someone who's been using Pandas for a few years now, I've noticed how many people (myself included) often resort to almost always using the apply
function. While this isn't an issue on smaller datasets, the performance issues caused by this become a lot more noticeable when working with larger amounts of data. While apply
‘s flexibility makes it an easy choice, this article introduces other Pandas' functions as potential alternatives.
In this post, we'll discuss the intended use for apply
, agg
, map
and transform
, with a few examples.
Table of contents
* map
* transform
* agg
* apply
* Unexpected behavior
An Example
Let's take a data frame comprising the scores of three students in two subjects. We'll work with this example as we go.
df_english = pd.DataFrame(
{
"student": ["John", "James", "Jennifer"],
"gender": ["male", "male", "female"],
"score": [20, 30, 30],
"subject": "english"
}
)
df_math = pd.DataFrame(
{
"student": ["John", "James", "Jennifer"],
"gender": ["male", "male", "female"],
"score": [90, 100, 95],
"subject": "math"
}
)
We'll now concatenate these to create a single dataframe.
df = pd.concat(
[df_english, df_math],
ignore_index=True
)
Our final dataframe looks like this:

We'll explore the use of each function using this dataset.
map
Series.map(arg, na_action=None) -> Series
The map
method works on a Series
and maps each value based on what is passed as arg
to the function. arg
can be a function – just like what apply
could take – but it can also be a dictionary or a series.
The na_action
essentially lets you decide what happens to NaN
values if they exist in the series. When set to "ignore"
, arg
won't be applied to NaN
values.
For eg, if you wanted to replace categorical values in your series using a mapping, you could do something like this:
GENDER_ENCODING = {
"male": 0,
"female": 1
}
df["gender"].map(GENDER_ENCODING)
The output is as expected: it returns a mapped value corresponding to each element in our original series.

Though apply
doesn't accept a dictionary, this can be still be accomplished with it, but it's not nearly as efficient or elegant.
df["gender"].apply(lambda x:
GENDER_ENCODING.get(x, np.nan)
)

On a simple test of encoding a gender series with a million records, map
was 10x faster than apply
.
Python">random_gender_series = pd.Series([
random.choice(["male", "female"]) for _ in range(1_000_000)
])
random_gender_series.value_counts()
"""
>>>
female 500094
male 499906
dtype: int64
"""
"""
map performance
"""
%%timeit
random_gender_series.map(GENDER_ENCODING)
# 41.4 ms ± 4.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
"""
apply performance
"""
%%timeit
random_gender_series.apply(lambda x:
GENDER_ENCODING.get(x, np.nan)
)
# 417 ms ± 5.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Since map
can take functions as well, any kind of transformation that doesn't depend on other elements – unlike an aggregation like mean, for example – can be passed.
Using things like map(len)
or map(upper)
can really make preprocessing much easier.
Let's assign this gender encoding result back to our data frame and move on to applymap
.
df["gender"] = df["gender"].map(GENDER_ENCODING)

applymap
DataFrame.applymap(func, na_action=None, **kwargs) -> DataFrame
I won't spend too long on applymap
since it's very similar to map
and is internally implemented using apply
. applymap
works elementwise on a dataframe, just like what map
does, but since it's internally implemented with apply
, it can't take a dictionary or a Series as input – only functions are allowed.
try:
df.applymap(dict())
except TypeError as e:
print("Only callables are valid! Error:", e)
"""
Only callables are valid! Error: the first argument must be callable
"""
na_action
works just like it does in map
.
transform
DataFrame.transform(func, axis=0, *args, **kwargs) -> DataFrame
While the previous two functions worked at the element level, transform works at the column level. This means that you can make use of aggregating logic with transform
.
Let's continue working with the same dataframe as before.

Let's say we wanted to standardize our data. We could do something like this:
df.groupby("subject")["score"]
.transform(
lambda x: (x - x.mean()) / x.std()
)
"""
0 -1.154701
1 0.577350
2 0.577350
3 -1.000000
4 1.000000
5 0.000000
Name: score, dtype: float64
"""
What we're essentially doing is taking the score series from each group, and replacing each element with its standardized value. This can't be done with map
, since it requires column-wise computation while map
only works element-wise.
If you're familiar with apply
, you'll know that this behavior can also be implemented with it.
df.groupby("subject")["score"]
.apply(
lambda x: (x - x.mean()) / x.std()
)
"""
0 -1.154701
1 0.577350
2 0.577350
3 -1.000000
4 1.000000
5 0.000000
Name: score, dtype: float64
"""
We get essentially the same thing. Then what's the point of using transform
?
transform
must return a dataframe with the same length along the axis that it's applied.
Remember that transform
must return a dataframe with the same length along the axis it's applied on. What this means is that even if transform
is used with a groupby
operation that returns aggregate values, it assigns those aggregate values to each element.
For example, let's say we wanted to know the sum of the scores of all students for each subject. We could do this with apply like so:
df.groupby("subject")["score"]
.apply(
sum
)
"""
subject
english 80
math 285
Name: score, dtype: int64
"""
But here, we've aggregated scores by subject lost information on how individual students and their scores relate. If we tried doing the same thing with transform
, we'll get something a lot more interesting:
df.groupby("subject")["score"]
.transform(
sum
)
"""
0 80
1 80
2 80
3 285
4 285
5 285
Name: score, dtype: int64
"""
So though we worked at a group level, we were still able to keep track of how group-level information relates to row-level information.
Because of this behavior, transform
throws a ValueError
if your logic doesn't return a transformed series. So any kind of aggregation wouldn't work. apply
‘s flexibility, however, ensures that it works just fine even with aggregations, as we'll see in detail in the next section.
try:
df["score"].transform("mean")
except ValueError as e:
print("Aggregation doesn't work with transform. Error:", e)
"""
Aggregation doesn't work with transform. Error: Function did not transform
"""
df["score"].apply("mean")
"""
60.833333333333336
"""
Performance
In terms of performance, there's a 2x speedup on switching from apply
to transform
.
random_score_df = pd.DataFrame({
"subject": random.choices(["english", "math", "science", "history"], k=1_000_000),
"score": random.choices(list(np.arange(1, 100)), k=1_000_000)
})

"""
Transform Performance Test
"""
%%timeit
random_score_df.groupby("subject")["score"]
.transform(
lambda x: (x - x.mean()) / x.std()
)
"""
202 ms ± 5.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
"""
"""
Apply Performance Test
"""
%%timeit
random_score_df.groupby("subject")["score"]
.apply(
lambda x: (x - x.mean()) / x.std()
)
"""
401 ms ± 5.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
"""
agg
DataFrame.agg(func=None, axis=0, *args, **kwargs)
-> scalar | pd.Series | pd.DataFrame
The agg
function is a lot easier to understand since it simply returns an aggregate over the data that's passed to it. So regardless of how your custom aggregator is implemented, the result will be a single value for each column that's passed to it.
We'll now look at a simple aggregation – computing each group's mean over the score
column. Notice how we can pass in a positional argument to agg
to **** directly name the aggregated result.
df.groupby("subject")["score"].agg(mean_score="mean").round(2)

Multiple aggregators can be passed as a list.
df.groupby("subject")["score"].agg(
["min", "mean", "max"]
).round(2)

agg
offers a lot more options to perform aggregations. In the previous two examples, we saw that it lets you perform multiple aggregations in a list and even named aggregations. You can also build custom aggregators as well as perform multiple specific aggregations over each column, like calculating the mean on one column and median on another.
Performance
In terms of performance, agg
is moderately faster than apply
, at least for simple aggregations. Let's recreate the same dataframe from the previous performance test.
random_score_df = pd.DataFrame({
"subject": random.choices(["english", "math", "science", "history"], k=1_000_000),
"score": random.choices(list(np.arange(1, 100)), k=1_000_000)
})

"""
Agg Performance Test
"""
%%timeit
random_score_df.groupby("subject")["score"].agg("mean")
"""
74.2 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
"""
"""
Apply Performance Test
"""
%%timeit
random_score_df.groupby("subject")["score"].apply(lambda x: x.mean())
"""
102.3 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
"""
We see an approximately 30% boost in performance when using agg
over apply
. When testing over multiple aggregations, we get similar results.
"""
Multiple Aggregators Performance Test with agg
"""
%%timeit
random_score_df.groupby("subject")["score"].agg(
["min", "mean", "max"]
)
"""
90.5 ms ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
"""
"""
Multiple Aggregators Performance Test with apply
"""
%%timeit
random_score_df.groupby("subject")["score"].apply(
lambda x: pd.Series(
{"min": x.min(), "mean": x.mean(), "max": x.max()}
)
).unstack()
"""
104 ms ± 5.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
"""
apply
For me, this was the most confusing one out of the ones we've discussed, mainly due to how flexible it is. Each of the examples above can be replicated with apply
as we just saw.
Of course, this flexibility comes at a cost: it's noticeably slower, as demonstrated by our performance tests.

Unexpected behavior
The other issue with apply
‘s flexibility is that the result is sometimes surprising.
Processing the first group twice
One such issue, which is now resolved, was with regard to certain performance optimizations: apply
would process the first group twice. The first time, it would look for optimizations, and then it would process each group, thus processing the first group twice.
I first noticed this while debugging a custom apply function I had written: when I printed out the group's information, the first group was displayed twice. This behavior would lead to silent errors if there were any side effects since any updates would happen twice on the first group.
When there's only a single group
This issue has been plaguing pandas since at least 2014. It happens when there's only a single group in the entire column. In such a scenario, even though the apply
function is expected to return a series, it ends up yielding a dataframe.
The result is similar to an additional unstacking operation. Let's try and reproduce it. We'll use our original dataframe and add a city
column. Let's assume that all of our three students, John, James, and Jennifer are from Boston.
df_single_group = df.copy()
df_single_group["city"] = "Boston"

Now, let's calculate the group-wise mean for two sets of groups: one based on the subject
column, and the other on city
.
Grouping on the subject
column, we get a multi-indexed series as we'd expect.
df_single_group.groupby("subject").apply(lambda x: x["score"])

But when we group by the city
column, which as we know has only one group (corresponding to "Boston"
), we get this:
df_single_group.groupby("city").apply(lambda x: x["score"])

Notice how the result is pivoted? If we stack
this, we'll get back the expected result.
df_single_group.groupby("city").apply(lambda x: x["score"]).stack()

As of this writing, this issue still hasn't been fixed.
Code
You can find the entire code along with the performance tests here.
Conclusion
The flexibility that apply
provides ** makes it a very convenient choice in most scenarios, but as we saw, it's often more efficient to use something that's been designed for what you need to accomplish. This post covers only part of apply**
‘s story, and there's so much more to this function. A future post will continue from here.
This post should have given you an idea of what's possible with Pandas, and I hope this encourages you to make full use of its functionality.