Pandas: apply, map or transform?

Author:Murphy  |  View: 26940  |  Time: 2025-03-23 19:58:17
Photo by Sid Balachandran on Unsplash

As someone who's been using Pandas for a few years now, I've noticed how many people (myself included) often resort to almost always using the apply function. While this isn't an issue on smaller datasets, the performance issues caused by this become a lot more noticeable when working with larger amounts of data. While apply‘s flexibility makes it an easy choice, this article introduces other Pandas' functions as potential alternatives.

In this post, we'll discuss the intended use for apply, agg, map and transform, with a few examples.


Table of contents

* map
* transform
* agg
* apply
* Unexpected behavior

An Example

Let's take a data frame comprising the scores of three students in two subjects. We'll work with this example as we go.

df_english = pd.DataFrame(
    {
        "student": ["John", "James", "Jennifer"],
        "gender": ["male", "male", "female"],
        "score": [20, 30, 30],
        "subject": "english"
    }
)

df_math = pd.DataFrame(
    {
        "student": ["John", "James", "Jennifer"],
        "gender": ["male", "male", "female"],
        "score": [90, 100, 95],
        "subject": "math"
    }
)

We'll now concatenate these to create a single dataframe.

df = pd.concat(
    [df_english, df_math],
    ignore_index=True
)

Our final dataframe looks like this:

The example dataframe

We'll explore the use of each function using this dataset.


map

Series.map(arg, na_action=None) -> Series

The map method works on a Series and maps each value based on what is passed as arg to the function. arg can be a function – just like what apply could take – but it can also be a dictionary or a series.

The na_action essentially lets you decide what happens to NaN values if they exist in the series. When set to "ignore" , arg won't be applied to NaN values.

For eg, if you wanted to replace categorical values in your series using a mapping, you could do something like this:

GENDER_ENCODING = {
    "male": 0,
    "female": 1
}
df["gender"].map(GENDER_ENCODING)

The output is as expected: it returns a mapped value corresponding to each element in our original series.

The output of map

Though apply doesn't accept a dictionary, this can be still be accomplished with it, but it's not nearly as efficient or elegant.

df["gender"].apply(lambda x:
    GENDER_ENCODING.get(x, np.nan)
)
The output of apply is identical to that of map

Performance

On a simple test of encoding a gender series with a million records, map was 10x faster than apply.

Python">random_gender_series = pd.Series([
    random.choice(["male", "female"]) for _ in range(1_000_000)
])

random_gender_series.value_counts()

"""
>>>
female    500094
male      499906
dtype: int64
"""
"""
map performance
"""
%%timeit
random_gender_series.map(GENDER_ENCODING)

# 41.4 ms ± 4.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
"""
apply performance
"""
%%timeit
random_gender_series.apply(lambda x:
    GENDER_ENCODING.get(x, np.nan)
)

# 417 ms ± 5.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Since map can take functions as well, any kind of transformation that doesn't depend on other elements – unlike an aggregation like mean, for example – can be passed.

Using things like map(len) or map(upper) can really make preprocessing much easier.

Let's assign this gender encoding result back to our data frame and move on to applymap.

df["gender"] = df["gender"].map(GENDER_ENCODING)
Encoding the gender with map

applymap

DataFrame.applymap(func, na_action=None, **kwargs) -> DataFrame

I won't spend too long on applymap since it's very similar to map and is internally implemented using apply. applymap works elementwise on a dataframe, just like what map does, but since it's internally implemented with apply , it can't take a dictionary or a Series as input – only functions are allowed.

try: 
    df.applymap(dict())

except TypeError as e:
    print("Only callables are valid! Error:", e)

"""
Only callables are valid! Error: the first argument must be callable
"""

na_action works just like it does in map.


transform

DataFrame.transform(func, axis=0, *args, **kwargs) -> DataFrame

While the previous two functions worked at the element level, transform works at the column level. This means that you can make use of aggregating logic with transform .

Let's continue working with the same dataframe as before.

Our example, with the encoded gender

Let's say we wanted to standardize our data. We could do something like this:

df.groupby("subject")["score"] 
    .transform(
        lambda x: (x - x.mean()) / x.std()
    )

"""
0   -1.154701
1    0.577350
2    0.577350
3   -1.000000
4    1.000000
5    0.000000
Name: score, dtype: float64
"""

What we're essentially doing is taking the score series from each group, and replacing each element with its standardized value. This can't be done with map, since it requires column-wise computation while map only works element-wise.

If you're familiar with apply, you'll know that this behavior can also be implemented with it.

df.groupby("subject")["score"] 
    .apply(
        lambda x: (x - x.mean()) / x.std()
    )

"""
0   -1.154701
1    0.577350
2    0.577350
3   -1.000000
4    1.000000
5    0.000000
Name: score, dtype: float64
"""

We get essentially the same thing. Then what's the point of using transform?

transform must return a dataframe with the same length along the axis that it's applied.

Remember that transform must return a dataframe with the same length along the axis it's applied on. What this means is that even if transform is used with a groupby operation that returns aggregate values, it assigns those aggregate values to each element.

For example, let's say we wanted to know the sum of the scores of all students for each subject. We could do this with apply like so:

df.groupby("subject")["score"] 
    .apply(
        sum
    )

"""
subject
english     80
math       285
Name: score, dtype: int64
"""

But here, we've aggregated scores by subject lost information on how individual students and their scores relate. If we tried doing the same thing with transform , we'll get something a lot more interesting:

df.groupby("subject")["score"] 
    .transform(
        sum
    )

"""
0     80
1     80
2     80
3    285
4    285
5    285
Name: score, dtype: int64
"""

So though we worked at a group level, we were still able to keep track of how group-level information relates to row-level information.

Because of this behavior, transform throws a ValueError if your logic doesn't return a transformed series. So any kind of aggregation wouldn't work. apply ‘s flexibility, however, ensures that it works just fine even with aggregations, as we'll see in detail in the next section.

try:
    df["score"].transform("mean")
except ValueError as e:
    print("Aggregation doesn't work with transform. Error:", e)

"""
Aggregation doesn't work with transform. Error: Function did not transform
"""
df["score"].apply("mean")

"""
60.833333333333336
"""

Performance

In terms of performance, there's a 2x speedup on switching from apply to transform.

random_score_df = pd.DataFrame({
    "subject": random.choices(["english", "math", "science", "history"], k=1_000_000),
    "score": random.choices(list(np.arange(1, 100)), k=1_000_000)
})
A 1M row dataframe for testing transform's performance
"""
Transform Performance Test
"""
%%timeit
random_score_df.groupby("subject")["score"] 
    .transform(
        lambda x: (x - x.mean()) / x.std()
    )

"""
202 ms ± 5.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
"""
"""
Apply Performance Test
"""
%%timeit
random_score_df.groupby("subject")["score"] 
    .apply(
        lambda x: (x - x.mean()) / x.std()
    )

"""
401 ms ± 5.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
"""

agg

DataFrame.agg(func=None, axis=0, *args, **kwargs) 
    -> scalar | pd.Series | pd.DataFrame

The agg function is a lot easier to understand since it simply returns an aggregate over the data that's passed to it. So regardless of how your custom aggregator is implemented, the result will be a single value for each column that's passed to it.

We'll now look at a simple aggregation – computing each group's mean over the score column. Notice how we can pass in a positional argument to agg to **** directly name the aggregated result.

df.groupby("subject")["score"].agg(mean_score="mean").round(2)
Mean scores by subject using agg

Multiple aggregators can be passed as a list.

df.groupby("subject")["score"].agg(
    ["min", "mean", "max"]
).round(2)
Mean scores by subject using apply – identical to our previous result.

agg offers a lot more options to perform aggregations. In the previous two examples, we saw that it lets you perform multiple aggregations in a list and even named aggregations. You can also build custom aggregators as well as perform multiple specific aggregations over each column, like calculating the mean on one column and median on another.

Performance

In terms of performance, agg is moderately faster than apply , at least for simple aggregations. Let's recreate the same dataframe from the previous performance test.

random_score_df = pd.DataFrame({
    "subject": random.choices(["english", "math", "science", "history"], k=1_000_000),
    "score": random.choices(list(np.arange(1, 100)), k=1_000_000)
})
The same dataframe as before, for performance testing
"""
Agg Performance Test
"""

%%timeit
random_score_df.groupby("subject")["score"].agg("mean")

"""
74.2 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
"""
"""
Apply Performance Test
"""

%%timeit
random_score_df.groupby("subject")["score"].apply(lambda x: x.mean())
"""
102.3 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
"""

We see an approximately 30% boost in performance when using agg over apply. When testing over multiple aggregations, we get similar results.

"""
Multiple Aggregators Performance Test with agg
"""
%%timeit
random_score_df.groupby("subject")["score"].agg(
    ["min", "mean", "max"]
)

"""
90.5 ms ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
"""
"""
Multiple Aggregators Performance Test with apply
"""
%%timeit
random_score_df.groupby("subject")["score"].apply(
    lambda x: pd.Series(
        {"min": x.min(), "mean": x.mean(), "max": x.max()}
    )
).unstack()

"""
104 ms ± 5.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
"""

apply

For me, this was the most confusing one out of the ones we've discussed, mainly due to how flexible it is. Each of the examples above can be replicated with apply as we just saw.

Of course, this flexibility comes at a cost: it's noticeably slower, as demonstrated by our performance tests.

Performance Tests: the apply function is noticeably slower, and understandably so.

Unexpected behavior

The other issue with apply ‘s flexibility is that the result is sometimes surprising.

Processing the first group twice

One such issue, which is now resolved, was with regard to certain performance optimizations: apply would process the first group twice. The first time, it would look for optimizations, and then it would process each group, thus processing the first group twice.

I first noticed this while debugging a custom apply function I had written: when I printed out the group's information, the first group was displayed twice. This behavior would lead to silent errors if there were any side effects since any updates would happen twice on the first group.


When there's only a single group

This issue has been plaguing pandas since at least 2014. It happens when there's only a single group in the entire column. In such a scenario, even though the apply function is expected to return a series, it ends up yielding a dataframe.

The result is similar to an additional unstacking operation. Let's try and reproduce it. We'll use our original dataframe and add a city column. Let's assume that all of our three students, John, James, and Jennifer are from Boston.

df_single_group = df.copy()
df_single_group["city"] = "Boston"
Our dataframe with the additional "city" column added

Now, let's calculate the group-wise mean for two sets of groups: one based on the subject column, and the other on city.

Grouping on the subject column, we get a multi-indexed series as we'd expect.

df_single_group.groupby("subject").apply(lambda x: x["score"])
apply returns a multi-indexed series when there are multiple groups

But when we group by the city column, which as we know has only one group (corresponding to "Boston"), we get this:

df_single_group.groupby("city").apply(lambda x: x["score"])
apply returns an unstacked dataframe when there's only one group

Notice how the result is pivoted? If we stack this, we'll get back the expected result.

df_single_group.groupby("city").apply(lambda x: x["score"]).stack()
Stacking our previous result yields an expected result

As of this writing, this issue still hasn't been fixed.


Code

You can find the entire code along with the performance tests here.

BlogCode/PandasApply at main · Polaris000/BlogCode


Conclusion

The flexibility that apply provides ** makes it a very convenient choice in most scenarios, but as we saw, it's often more efficient to use something that's been designed for what you need to accomplish. This post covers only part of apply** ‘s story, and there's so much more to this function. A future post will continue from here.

This post should have given you an idea of what's possible with Pandas, and I hope this encourages you to make full use of its functionality.

Tags: Data Science Pandas Performance Python

Comment