5 Functions is All You Need to Manage Your Data with dplyr

Author:Murphy  |  View: 26467  |  Time: 2025-03-23 19:26:39

Those who have worked on real-life tasks with real-life data know that the bigger slice of pie belongs to data wrangling.

What I mean by data wrangling covers operations done to make the data ready for other stakeholders or downstream processes.

Whether you are a data analyst, data scientist, or data engineer, you will need to perform one or more of the following operations as part of your daily routine:

  • Filtering (e.g. get me the sales data in Texas)
  • Sorting (e.g. I want to see the top 10 best seller in last week)
  • Updating (e.g. change the category of these products)
  • Summarizing (e.g. I want to see the average revenue per category)

Data analysis and manipulation tools exist in the Data Science ecosystem to provide efficient ways of performing these operations to be able to keep their existence in the ecosystem.

In this article, we will learn how to handle these tasks using one of the predominant tools in data science: dplyr.

It is a package for R Programming language and described as "a grammar of data manipulation".

We can either install dplyr by itself or use tidyverse, which is a collection of R packages for data science. I prefer the latter because it allows me to use some functions from other packages in tidyverse (e.g. read_csv from readr).

Let's start with importing the libraries and reading a dataset. We will be using a sample dataset that I prepared with mock data. You can download it from my datasets repo.

library(tidyverse)

sales <- read_csv("sales_data_with_stores.csv")

# display the first 6 rows
head(sales)

# A tibble: 6 x 8
  store  product_group product_code stock_qty   cost   price last_week_sales last_month_sales
                                                     
1 Violet PG2                   4187       498 421.    570.                13               58
2 Rose   PG2                   4195       473 546.    712.                16               58
3 Violet PG2                   4204       968 640.    855.                22               88
4 Daisy  PG2                   4219       241 870.   1035.                14               45
5 Daisy  PG2                   4718      1401  12.5    26.6               50              285
6 Violet PG4                   5630       287   5.85    7.59              24              116

The resulting object is a tibble, which is just like a DataFrame or table.


All data manipulation challenges come down to using one or more of the following 5 functions:

  • select
  • mutate
  • filter
  • arrange
  • summarise

1. Select

The select function can be used for selecting columns. It allows us to do so by keeping or dropping columns using their names and types.

Let's go over several examples to see how select works. The following example shows how to select a particular column or columns.

# select product_code and price columns
select(sales, product_code, price)

# A tibble: 1,000 x 2
   product_code   price
             
 1         4187  570.  
 2         4195  712.  
 3         4204  855.  
 4         4219 1035.  
 5         4718   26.6 

# the following does the same
select(sales, c(product_code, price))

We can also do the selection by excluding (dropping) one or more columns using the "!" as follows:

# select all but product_group column
select(sales, !product_group)

# A tibble: 1,000 x 7
   store  product_code stock_qty   cost   price last_week_sales last_month_sales
                                             
 1 Violet         4187       498 421.    570.                13               58
 2 Rose           4195       473 546.    712.                16               58
 3 Violet         4204       968 640.    855.                22               88
 4 Daisy          4219       241 870.   1035.                14               45
 5 Daisy          4718      1401  12.5    26.6               50              285

If we are selecting multiple consecutive columns, it's easier to use ":" to define the range.

# select all columns from store to cost
select(sales, store:cost)

# A tibble: 1,000 x 5
   store  product_group product_code stock_qty   cost
                            
 1 Violet PG2                   4187       498 421.  
 2 Rose   PG2                   4195       473 546.  
 3 Violet PG2                   4204       968 640.  
 4 Daisy  PG2                   4219       241 870.  
 5 Daisy  PG2                   4718      1401  12.5 

We can also use column indices so the following does the same operation as the one above:

# select all columns from store to cost
select(sales, 1:5)

We can also select columns based on their data types. For instance, the following selects the numerical columns.

# select all numeric columns
select(sales, where(is.numeric))

# A tibble: 1,000 x 6
   product_code stock_qty   cost   price last_week_sales last_month_sales
                                           
 1         4187       498 421.    570.                13               58
 2         4195       473 546.    712.                16               58
 3         4204       968 640.    855.                22               88
 4         4219       241 870.   1035.                14               45
 5         4718      1401  12.5    26.6               50              285

By using the same predicate function (i.e. where), we can select non-numeric columns as follows:

# select all non-numeric columns
select(sales, !where(is.numeric))

# A tibble: 1,000 x 2
   store  product_group
             
 1 Violet PG2          
 2 Rose   PG2          
 3 Violet PG2          
 4 Daisy  PG2          
 5 Daisy  PG2 

2. Mutate

The mutate function, as its name suggests, modifies the tibble by updating the existing columns or creating new ones.

For instance, we can increase the price values by 10 percent as follows:

# increase price by 10 percent
mutate(sales, price = price * 1.1)

# A tibble: 1,000 x 8
   store  product_group product_code stock_qty   cost   price last_week_sales last_month_sales
                                                      
 1 Violet PG2                   4187       498 421.    627.                13               58
 2 Rose   PG2                   4195       473 546.    784.                16               58
 3 Violet PG2                   4204       968 640.    940.                22               88
 4 Daisy  PG2                   4219       241 870.   1138.                14               45
 5 Daisy  PG2                   4718      1401  12.5    29.2               50              285

We can also create new columns:

# create price_updated column by increasing price by 10 percent
mutate(sales, price_updated = price * 1.1)

# A tibble: 1,000 x 9
   store  product_group product_code stock_qty   cost   price last_week_sales last_month_sales price_updated
                                                               
 1 Violet PG2                   4187       498 421.    570.                13               58        627.  
 2 Rose   PG2                   4195       473 546.    712.                16               58        784.  
 3 Violet PG2                   4204       968 640.    855.                22               88        940.  
 4 Daisy  PG2                   4219       241 870.   1035.                14               45       1138.  
 5 Daisy  PG2                   4718      1401  12.5    26.6               50              285         29.2

Multiple updates can be done in a single operation. Before doing an example to demonstrate this case, let's mention two very useful features:

  • We can combine different types of operations using the "%>%" operator to create pipes and solve complex tasks involving multiple steps.
  • The new columns are immediately available so that we can use them in the same mutate function.

Here is an example to show the two features mentioned above:

> sales %>%
+     select(cost, price) %>%
+     mutate(
+         price_updated = price * 1.1,
+         profit_updated = price_updated - cost)

# A tibble: 1,000 x 4
     cost   price price_updated profit_updated
                          
 1 421.    570.          627.           206.  
 2 546.    712.          784.           238.  
 3 640.    855.          940.           300.  
 4 870.   1035.         1138.           268.  
 5  12.5    26.6          29.2           16.7 

In the example above, we first select the price and cost columns from sales and create the price_updated column from the price column and then use this new column to create the profit_updated column.


3. Filter

The filter function lets us filter observations (i.e. rows) based on a condition or a set of conditions.

The following example filters rows with a price of more than 1000.

filter(sales, price > 1000)

# A tibble: 5 x 8
  store  product_group product_code stock_qty  cost price last_week_sales last_month_sales
                                                  
1 Daisy  PG2                   4219       241  870. 1035.              14               45
2 Violet PG1                   9692        68 1243  1500.              26               94
3 Violet PG1                   7773       602  976. 1325.              19               60
4 Daisy  PG1                   1941       213  847  1177.              18               72
5 Daisy  PG1                   4140        92  803  1202.              12               24

Let's use a more complex filter:

# rows with a price of more than 1000 and store is Daisy
filter(sales, price > 1000 & store == "Daisy")

# A tibble: 3 x 8
  store product_group product_code stock_qty  cost price last_week_sales last_month_sales
                                                 
1 Daisy PG2                   4219       241  870. 1035.              14               45
2 Daisy PG1                   1941       213  847  1177.              18               72
3 Daisy PG1                   4140        92  803  1202.              12               24

If we have a condition with multiple values, we can use the "%in%" operator.

# rows with product group of PG3, PG4, or PG5 and store is Daisy
filter(sales, product_group %in% c("PG3", "PG4", "PG5") & store == "Daisy")

# A tibble: 302 x 8
   store product_group product_code stock_qty  cost price last_week_sales last_month_sales
                                                  
 1 Daisy PG4                   5634       205 14.2   18.0              14               53
 2 Daisy PG4                   2650       239 59.4  111.               15               38
 3 Daisy PG4                   5647       352  5.85  13.3              37              108
 4 Daisy PG4                   5693       260  7.62  13.3              19               74
 5 Daisy PG4                   5696       260  7.62  13.3              29               98

4. Arrange

The arrange function sorts the rows based on the values in a column or columns. It is similar to SQL order by and Pandas sort_values function.

# order rows by price
arrange(sales, price)

# A tibble: 1,000 x 8
   store  product_group product_code stock_qty  cost price last_week_sales last_month_sales
                                                   
 1 Violet PG4                   5454       -22 0.570  0.66              31               28
 2 Violet PG5                   5621      -123 1.32   0.76              49              100
 3 Daisy  PG4                   2279       525 1.34   1.23              14               18
 4 Rose   PG2                   9372       350 1.41   1.42              16               40
 5 Rose   PG4                   2138       482 1.63   1.61              13               23

By default, the sorting is done in ascending order. We can change it to descending by adding a "-" before the column name.

# order rows by price descending
arrange(sales, -price)

# A tibble: 1,000 x 8
   store  product_group product_code stock_qty  cost price last_week_sales last_month_sales
                                                   
 1 Violet PG1                   9692        68 1243  1500.              26               94
 2 Violet PG1                   7773       602  976. 1325.              19               60
 3 Daisy  PG1                   4140        92  803  1202.              12               24
 4 Daisy  PG1                   1941       213  847  1177.              18               72
 5 Daisy  PG2                   4219       241  870. 1035.              14               45

To sort by multiple columns, we can write the column names separated by comma. The following example sorts the rows by store name and then by last_week_sales (descending).

arrange(sales, store, -last_week_sales)

# A tibble: 1,000 x 8
   store product_group product_code stock_qty  cost price last_week_sales last_month_sales
                                                  
 1 Daisy PG6                    856     52748 31.4  38.0             1883             6880
 2 Daisy PG3                   2481     10543  8.25 14.2              947             4100
 3 Daisy PG6                   3957     10090 26.9  31.3              867             2355
 4 Daisy PG6                    889     21569 13.0  16.1              808             2990
 5 Daisy PG6                   9635     26576  8.53 11.4              673             2484

5. Summarise

It summarizes groups down to one row. Together with the group_by function, we can use it for calculating aggregated values for groups. This operation is similar to SQL group by and Pandas groupby function.

# group by rows in sales by store
by_store <- group_by(sales, store)

# calculate the avg price for each store
summarise(by_store, avg_price = mean(price))

# A tibble: 3 x 2
  store  avg_price
*       
1 Daisy       69.3
2 Rose        60.5
3 Violet      67.8

There are several aggregate functions to extract insights from data and perform in-depth analysis.

Here is a list of them from the official documentation:

  • Center: [mean()](https://rdrr.io/r/base/mean.html), [median()](https://rdrr.io/r/stats/median.html)
  • Spread: [sd()](https://rdrr.io/r/stats/sd.html), [IQR()](https://rdrr.io/r/stats/IQR.html), [mad()](https://rdrr.io/r/stats/mad.html)
  • Range: [min()](https://rdrr.io/r/base/Extremes.html), [max()](https://rdrr.io/r/base/Extremes.html),
  • Position: [first()](https://dplyr.tidyverse.org/reference/nth.html), [last()](https://dplyr.tidyverse.org/reference/nth.html), [nth()](https://dplyr.tidyverse.org/reference/nth.html),
  • Count: [n()](https://dplyr.tidyverse.org/reference/context.html), [n_distinct()](https://dplyr.tidyverse.org/reference/n_distinct.html)
  • Logical: [any()](https://rdrr.io/r/base/any.html), [all()](https://rdrr.io/r/base/all.html)

The functions we have covered in this article help you get almost all your data manipulation and analysis work done. Being able to use them together in a pipe provides further flexibility.

You can become a Medium member to unlock full access to my writing, plus the rest of Medium. If you already are, don't forget to subscribe if you'd like to get an email whenever I publish a new article.

Thank you for reading. Please let me know if you have any feedback.

Tags: Artificial Intelligence Data Science Machine Learning Programming R

Comment