How to Use Map Functions for Data Science in R

Author:Murphy | View: 21422 | Time: 2025-03-23 19:55:48

All data scientists need to repeat code. Whether you're fitting a model to multiple datasets or changing many values at once, running the same code many times over is essential.

There are many ways to repeat code. But while most programmers use loops, there are more succinct, readable, and efficient alternatives. Enter, the map family of functions from the purrr package.

In this article, I'll explain what mapping means, and how to use the map, map2, and pmap functions from the purrr package in R.

What is "Mapping", and How is it Done in R?

The kind of mapping that the purrr package does isn't the geographical type that most people are familiar with. R has great tools for geospatial analysis, but that's not what we're talking about here.

"Mapping" is a specialised term in programming that refers to applying a function repeatedly across a set of arguments.

We can use a simple example to understand this. Let's say we have a list, with each element containing 100 numbers. If we want to calculate the mean of each set of numbers, we can use map to do this in a straightforward way.

library(tidyverse)

# Set seed for reproducibility
set.seed(1234)

# Generate list of numeric values for this example
values_list <- list(rnorm(100, 100, 15), 
                    rnorm(100, 110, 20), 
                    rnorm(100, 75, 15))

# Use map to get the mean of each set of numbers
map(values_list, mean)

Here, we first load in the tidyverse, which contains the purrr package that provides our map functions. Then, we use map to get the mean for each list element.

The basic map function takes two arguments. First, we specify the list or vector we want to apply a function across; in this case, values_list. Second, we name the function we want to apply; mean. This makes map a "higher-order function" because it takes another function as an argument. Running this code gives us the mean of each element from values_list, as shown below.

As well as naming the function we want to apply, we can also use a formula to define the function. This is useful when defining functions with additional arguments, or more complex expressions. In the code below, we use such a formula to get the mean of each list element even when there are missing values in the data.

# Add some missing values to the second element of the list
values_list[[2]][c(2, 24, 93)] <- NA

# Get the means with map again, this time discounting NAs
map(values_list, ~ mean(., na.rm = T))

In this code, we define the function as a formula with the tilde (~) symbol. Inside mean, we refer to the list element used as an argument with the dot (.) symbol. Defining the mean function with a formula allows us to specify extra arguments like na.rm = TRUE which deals with the NA values in values_list. This gives a list of means like the previous example.

Why Use Map Functions Instead of Loops?

If you're familiar with for-loops, you might be seeing the similarities between what map is doing and how you could use a loop to solve the problem above. Here is the same operation expressed as a loop:

# Expressing the operation above as a loop
for (i in values_list) {
  print(mean(i, na.rm = TRUE))
}

This loop is very simple. Each iteration, it takes a new element from values_list and prints the mean of the values in that element, like the map example. There is even an R package that converts map statements into loops!

So if you can express map statements as loops, why should you learn about map at all?

There are a few advantages of mapping compared with looping:

map functions are often more concise, taking up one line rather than a minimum of three. This means they can be used inline and within other functions more easily, opening up powerful new possibilities for usage.
Unlike loops, mapping forces you to define functions. This leads to neater, more compartmentalised code that is easy to reuse.
Relating to the last point, defining functions often speeds up your code. This means mapping can be faster than looping (although this isn't guaranteed).

Loops still have their place, and sometimes it's preferable to use them when they're the fastest option, or when repeating very complex or unusual operations. That said, the map family can deal with more complex functions as well.

How To Use map Variants: map2 and pmap

The map2 and pmap functions are straightforward extensions of map.

map2

map2 lets you apply a function that takes two arguments. For example, you could use it to fit several linear models that take x and y variables as arguments.

# Define combinations of variables to model
x <- c("mpg", "hp", "wt")
y <- c("hp", "wt", "mpg")

# Apply the lm function to these variables using map2
map2(x, y, ~ lm(get(.x) ~ get(.y), data = mtcars))

Here we make two vectors, x and y, containing combinations of variables in the mtcars dataset. Inside our map2 function expression, we refer to these variables as .x and .y. The get functions simply help lm locate the columns of interest in mtcars using their names. This results in three linear models being fitted; one for each combination of variables we specified.

pmap

pmap lets you apply a functional expression with more than two arguments. Concatenating together values from three separate lists for further analysis, for instance.

# Define some example values from a clinical trial
baseline <- c(101, 92.3, 98.2)
treatment <- c(103.3, 92.1, 99.8)
followup <- c(112.1, 95.4, 104.2)

# Concatenate those values together, rowwise
pmap(list(baseline, treatment, followup), ~ c(..1, ..2, ..3))

When calling pmap, we wrap our inputs in a list. Each input is then referred to as ..1, ..2, ..3 and so on. The result is a list of concatenated values, shown below.

How to Control the Output of map Functions

You might have noticed that all the examples so far have returned lists as their output. However, it's possible to get map functions to return different types of vectors and dataframes instead.

Returning to the first example in this article, we can get a list of means using map. But we can also get a flat vector of means by using the map_dbl function.

map_dbl(values_list, ~ mean(., na.rm = TRUE))

The result of the code above: a vector of means, not a list.

Appending "_dbl" to our map function name enables this. There are a few other variants like this that work with map, map2, and pmap. Here is a list of them, and what they do:

"_dbl" returns a double vector
"_lgl" returns a logical (TRUE/FALSE) vector
"_int" returns an integer vector
"_chr" returns a character vector
"_raw" returns a raw vector
"_dfr" returns a tibble where each row is the result of a map operation
"_dfc" returns a tibble where each column is the result of a map operation

Using these function variants to return vectors and tibbles unlocks lots of possibilities. I often use map_dbl and map_chr inside other tidyverse functions like mutate to create new columns based on custom functions. I've also used map_dfr and map_dfc to replace long loops with only one or two lines of code. These are great functions, and they're without a doubt useful enough to learn and remember.

Should You Use map Functions Instead of Base-R "apply" Functions?

As we've established, the purrr package provides a whole family of map functions that can do all sorts of things.

However, R also comes with a built-in family of mapping functions; the apply functions. These have the same functionality as purrr's basic map functions, with some variations in syntax and how they treat data types going in and out.

Although the base-R apply functions are widely used, I prefer the purrr map functions for a few reasons:

It's easier to understand exactly what type of data map functions will output. This isn't always the case with the base-R functions, which can sometimes behave unpredictably based on their input.
They're built to be compatible with other tidyverse packages, which is useful if they're already part of your workflow.
Purrr also includes more advanced mapping functions that the apply family can't match. Learning map enables you to pick these up more easily.

There are some times when you might prefer using apply. If you're developing a package and don't want to introduce extra dependencies, built-in functions are the best option. apply functions can also be faster than their map counterparts, so they can be worth using to repeat computationally expensive operations.

Regardless, the map functions are easier for newcomers to learn, which is why I focused on them here. Besides, once you know how to use map, apply functions will be much easier to understand too.

Learning map functions in R is a great way to extend your Data Science toolset. Mapping often saves space when compared to looping, without sacrificing readability. It also applies to many tasks and makes a great standard approach for efficiently repeating code. So now you know how to use map, try it out, and enjoy one of the most powerful tools in the tidyverse.

Tags: Data Analysis Data Science Functional Programming Programming Rstats