A Simple Trick to Do Your Data Analysis in Seconds

Author:Murphy  |  View: 23364  |  Time: 2025-03-22 23:22:09
Image by 51581 from Pixabay

Exploratory Data Analysis (EDA) plays a crucial role in Data Science, which allows us to gain insights and understand the patterns within a dataset. In one of my previous articles, I introduced the convenience of a Python library called "Pandas GUI" which is an out-of-the-box Python EDA tool.

PandasGUI – The Ultimate Secret to Effortless Data Analysis

Now, let's turn our attention to "ydata-profiling," a successor to the popular "pandas-profiling" library. "ydata-profiling" offers advanced EDA capabilities and addresses the limitations of its predecessor, making it an invaluable resource for data scientists and analysts.

A Quick Start

Image by Stevenom from Pixabay

As always, before we can start to use the library, we need to install it using pip.

pip install ydata-profiling

To conduct EDA, we need to have a dataset. Let's use one of the most famous public datasets – the Iris dataset for this demo. You can get it from the Sci-kit Learn library. However, to make it easier, since we are not going to use the Sci-kit Learn library in this demo, I found the dataset on the datahub.io website which you can make use of directly.

https://datahub.io/machine-learning/iris/r/iris.csv

We can easily load the data from the URL into Pandas dataframe as follows.

Python">import pandas as pd

df = pd.read_csv("https://datahub.io/machine-learning/iris/r/iris.csv")
df.head()

Then, we can import the ProfileReport module from the ydata-profiling library to generate the EDA report from the pandas dataframe.

from ydata_profiling import ProfileReport

profile = ProfileReport(df, title="Pandas Profiling Report")

Next, we can render this "profile" in different ways. The title we have given will be displayed in some types of output. Now, let's see how well ydata-profiling can do.

1. Output as Notebook Widgets

This is probably the simplest and neatest output. If you just want to view the EDA report inside your notebook, this is highly recommended because it doesn't leave any footprints that everything will be deallocated when the Jupyter session is terminated.

To render the output as notebook widgets, we can simply call its to_widgets() method.

profile.to_widgets()

As shown in the screenshot, it doesn't have any fancy styles, and all the graphs and statistics are folded into a smaller frame with tabs navigation. The entire UI is pretty compact and neat. It will be enough in the scenarios that we want to get quick insights from a dataset to decide what to do for the next, during our day-to-day data analytics workloads.

2. Output as iFrame Inside Notebook

Personally, I don't quite recommend this type of output. It simply renders the report in HTML (an interactive web page) and embeds it in the notebook cell output. So, the elements are forced to fit the size and everything is squeezed into a small area.

However, if you would like to see what it looks like, just call the function to_notebook_iframe().

profile.to_notebook_iframe()

To navigate different sections, we have to scroll down and all the sections are stacked in a long streak in this iframe. Alternatively, we can also click the menu button on the top right corner, as shown in the screenshot below.

Probably, the only good thing about this iframe style is that it has larger diagrams than the notebook widget, such as follows.

3. Output as HTML File

If you prefer to have better readability, HTML outweighs the notebook widget, indeed. Therefore, we can choose to output the report to an HTML file, so anyone can open it using a web browser. That means the report can be shared easily, without impact on its interactive characteristics.

To output the report as an HTML file, we can simply run the following code.

profile.to_file("your_report.html")

If you're using Jupyter Notebook like me, you will be able to use your Jupyter UI to find the file. Simply click it and you will get the report opened in a new page.

Now, we can see the report as follows.

This generated HTML file is self-contained, meaning that you can share it with anyone else.

Customizing Metadata

Image by Tom from Pixabay

Now, what if we want to customize the report in particular ways? Let's look at the metadata of the report first.

In terms of metadata, we can change the title, add a description to the dataset, as well as add descriptions to the columns. These configurations will improve the report and provide more necessary information to the readers. Therefore, we may want to add this information when we need to share the report.

The code below will add these metadata to the report.

profile = ProfileReport(
    df, 
    title="Iris Dataset Profiling",
    dataset={
        "description": "This is a famous public dataset.",
        "url": "https://datahub.io/machine-learning/iris/r/iris.csv",
    },
    variables={
        "descriptions": {
            "sepallength": "Length of sepal",
            "sepalwidth": "Width of sepal",
            "petallength": "Length of petal",
            "petalwidth": "Width of sepal",
            "class": "Classification of Iris"
        }
    }
)

The title defines the title of the report, the dataset will add metadata to introduce the dataset, and the variables adds description to the columns.

After that, let's generate the report again.

profile.to_file("Iris_report.html")

Now, in the new report, we can see the corresponding changes.

Protect Sensitive Data

Image by Tayeb MEZAHDIA from Pixabay

What if our dataset is sensitive and needs to be kept confidential, but we also want to share the insights generated by ydata-profiling? That's easy. It supports us to either sample the limited number of rows or completely redact it.

For example, if we want to show only 5 rows from the dataset, we can do as follows.

sensitive_description = "Disclaimer: the dataset is sensitive so only 5 sample rows will be shared"

profile = ProfileReport(
    df,
    title="Iris Dataset Profiling (sensitive data)",
    sample={
        "name": "Mock data sample",
        "data": df.sample(5),
        "caption": sensitive_description,
    }
)
profile.to_file("Iris_report.html")

Now, the new report will only show 5 rows as a sample set.

Here is another trick. If we give an empty dataframe to sample, there won't be any rows revealed so the whole dataset could be redacted.

sample={
    "name": "Mock data sample",
    "data": df.sample(5),
    "caption": sensitive_description,
}

Passing Matplotlib Arguments

Image by Christoph Schütz from Pixabay

All the figures in the report are actually generated using Matplotlib. ydata-profile allows us to pass arguments to the Matplotlib objects, so we can customise the plots we are going to have in the report.

For example, you may find all the figures are in SVG format, which makes them difficult to reshare by themselves. We can pass the following arguments to the object.

{"dpi": 200, "image_format": "png"}

So, the code can be as follows.

profile = ProfileReport(
    df,
    title="Iris Dataset Profiling (matplotlib arguments pass on)",
    plot={"dpi": 200, "image_format": "png"},
)

profile.to_file("Iris_report.html")

Then, we will find that all the figures in the report are 200 dpi and can be downloaded by right-clicking and save as. Of course, you can also pass other arguments, such as the colour code of those correlation diagrams.

Summary

Image by Larisa Koshkina from Pixabay

In this article, I've introduced the library ydata-profiling. It is indeed an advanced successor to Pandas-Profiling, which used to be a famous EDA productivity Python library. I've demonstrated how to generate the report in different forms and the capabilities to enhance the report presentations. I believe this is one of the easiest to use libraries which will boost our productivity in data analytics-related jobs.

Unless otherwise noted all images are by the author

Tags: Artificial Intelligence Data Science Machine Learning Python Technology

Comment