4 Pandas One-Liners That Solve Particular Tasks Efficiently

Author:Murphy  |  View: 20163  |  Time: 2025-03-22 23:57:58
Photo by Tom Bradley on Unsplash

Third-party libraries are created and developed in response to a need. No one sits down and says "I'm gonna create a tool and wait for circumstances to arise in which others need it". Instead, they realize a problem and think of a solution to help solve it. That's how tools are created.

The same goes for adding new features to existing tools. How well and quickly new features are added depends on the popularity of the tool and the team behind it.

Pandas, for sure, has a highly active community that keeps Pandas as one of the most popular data analysis and cleaning libraries in the Data Science ecosystem.

Pandas has functions to solve very specific issues and use cases. These must have been demanded from the community who are actively using it.

In this article, I'll share 4 operations that you can do in one line of code with Pandas. These helped me solve particular tasks efficiently and surprised me in a good way.


1. Create a dictionary from a list

I have a list of items and I want to see the distribution of them. To be more specific, I want to see the unique values along with their number of occurrences in the list.

A Python dictionary is a great way to store data in this format. The items would be the dictionary keys and number of occurrences would be the values.

Thanks to the value_counts and to_dict functions, this task can be completed in one line of code.

Here is a simple example to demonstrate this case:

import pandas as pd

grades = ["A", "A", "B", "B", "A", "C", "A", "B", "C", "A"]

pd.Series(grades).value_counts().to_dict()

# output
{'A': 5, 'B': 3, 'C': 2}

We first convert the list to a Pandas Series, which is the one-dimensional data structure of Pandas. Then, we apply the value_counts function to get the unique values with their frequency in the Series. Finally, we convert the output to a dictionary.


2. Create a DataFrame from a JSON file

JSON is a frequently used file format for storing and delivering data. For instance, when you request data from an API, it's highly likely to be delivered in JSON.

When we clean, process, or analyze data, we usually prefer it to be in tabular format (i.e. in a table-like data structure). We can create a Pandas DataFrame from a JSON-formatted object with a single operation thanks to the json_normalize function.

Let's say the data is stored in a JSON file called data.json . We first read it as follows:

import json

with open("data.json") as f:
    data = json.load(f)

data
# output
{'data': [{'id': 101,
   'category': {'level_1': 'code design', 'level_2': 'method design'},
   'priority': 9},
  {'id': 102,
   'category': {'level_1': 'error handling', 'level_2': 'exception logging'},
   'priority': 8}]}

If we pass this variable to the DataFrame constructor, it'll create a DataFrame as follows, which is definitely not a usable format:

df = pd.DataFrame(data)
df (image by author)

But if we use the json_normalize function and provide the record path, we'll get a DataFrame in nice and clean format:

df = pd.json_normalize(data, "data")
df (image by author)

3. Explode function

Consider a case where you have a list of items that match a particular record. You need to reformat it in a way that there is a separate row for each item in that list.

The following drawing illustrates what I tried to explain:

(image by author)

You can think of many different ways of solving this task. One of the simplest (may be the simplest) is the explode function. Let's see how it works.

We have the following DataFrame:

df (image by author)

We'll use the explode function and specify the column name to be exploded:

df_new = df.explode(column="data").reset_index(drop=True)
df_new (image by author)

The reset_index assigns a new integer index to the resulting DataFrame. Otherwise, the index before exploding would be preserved (i.e. all the rows with a key value of A would have an index of 0).


4. Combine first

The combine_first function serves for a specific purpose but simplifies that specific task greatly.

The specific case where you'd want to use the combine_first function:

You want to extract a column from a DataFrame. If there are missing values in the column, you want to replace those missing values with a value from another column.

In this regard, it does the same thing as the COALESCE function in SQL.

Let's create a sample DataFrame with some missing values:

df = pd.DataFrame(
    {
        "A": [None, 0, 12, 5, None], 
        "B": [3, 4, 1, None, 11]
    }
)
df (image by author)

We need the data in column A. If there is a row with a missing value (i.e. NaN), we want it to be filled with the value of the same row in column B.

df["A"].combine_first(df["B"])

# output
0     3.0
1     0.0
2    12.0
3     5.0
4    11.0
Name: A, dtype: float64

As we see in the output, the first and last rows of column A are taken from column B.

If there are 3 columns that we want to use, we can chain combine_first functions. The following line of code first checks column A. If there is a missing value, it takes it from column B. If the corresponding row in column B is also NaN, then it takes the value from column C.

df["A"].combine_first(df["B"]).combine_first(df["C"])

We can also use the combine_first function on the DataFrame level. In that case, all the missing values are filled from the corresponding value (i.e. same row, same column) from the second DataFrame.


Final words

Pandas is one of the most versatile tools I've ever used. From calculating simple statistics to highly complex data cleaning processes, Pandas always helped me get a quick solution for the tasks I had. The only issue I had was when working with very large datasets, which seemed to be the only shortcoming of Pandas. However, there have been some developments recently to make Pandas operate more efficiently with large datasets. This is I believe good news for everyone who loves to work with this great tool.

Thank you for reading. Please let me know if you have any feedback.

Tags: Data Science Machine Learning Pandas Programming Python

Comment