All You Need Is Statistics to Analyze Tabular Datasets

Author:Murphy | View: 29206 | Time: 2025-03-23 11:30:59

To analyze tabular datasets there is no need for deep learning nor large language models. I will demonstrate how (simple) statistics, and techniques such as PCA can show new insights and explainable results.

Photo by Dan Cristian Pădureț on Unsplash

Tabular datasets are one of the most common forms of data and consist of a mix of variables such as binary, categorical, textual, and continuous values. A well-known tabular dataset is, for example, the Titanic dataset. The major challenge in such datasets is the way of analyzing the variables because analysis of categorical values needs different statistics and/or models than categorical values, and so on. In addition, key is also to determine multicollinearity in the dataset because variables with statistically similar behavior can affect the reliability of models. In this blog post I will demonstrate the steps of pre-processing tabular datasets and how statistical tests, such as Hypergeometric testing, can show the relationship across variables. In addition, I will explain the importance of multiple test corrections, and show how to apply Principal Component Analysis on a tabular dataset.

The Very First Step Is a Visual Inspection.

Tabular datasets contain measurements of continuous, categorical, and discrete values. Such datasets are usually rich with information and can be used for a broad variety of questions. Various techniques can be applied such as supervised techniques to predict target variables.

The first step to approach the dataset is to use your favorite text editor and visually inspect the dataset. Use common sense to determine whether variables can be related to each other.

A visual inspection is important because variables with statistically similar behavior can affect the reliability of models and thus influence the model output. As an example, if multiple variables are similar in some sense, (daily temperature, ice cream sales, beach attendance, parking spot availability) it can lead to multicollinearity.

Multicollinearity makes it difficult to determine the individual effect of a variable on the dependent variable because they carry redundant information.

As a result, coefficients can become unstable and making the model's output less trustworthy. Although there are various models that can handle correlated variables, such as regularization methods like Ridge or Lasso regression, such techniques are not designed to handle variables in tabular datasets. Before any of such methods can be used, you first need to bring all variables into the same dimensional space. See next section!

The Harmonization Step Using One-hot Encoding.

A visual inspection of the variables will give you a good sense of the dataset. But when there are tens (or more) variables, where each variable can have multiple states per category, it becomes time-consuming and error-prone to manually relate variables to each other. At this point, it becomes important to analyze all variables in a structured manner. A straightforward pre-processing step is one-hot encoding. Note: it is called one-hot encoding because categorical values are converted into binary values and set to 1 (hot) where appropriate. Bringing all variables of different types into the same dimensional space is the harmonization step. We can now use models that can learn from all features in the dataset, and prevent biases that can arise from differences in scales, and ensure that the relationships between variables are correctly captured. An example of one-hot encoding is as follows:

# ----------------------------------------------------------------------
# Example to one-hot encode catagorical variables using domain knowledge
# ----------------------------------------------------------------------

import pandas as pd

# Sample data
data = {'icecream': ['Strawberry', 'Kiwi', 'Vanille', 'Kiwi', 'Strawberry']}
df = pd.DataFrame(data)
print(df)
#    icecream
# 0  Strawberry
# 1  Kiwi
# 2  Vanille
# 3  Kiwi
# 4  Strawberry

# One-hot encoding
one_hot_encoded_df = pd.get_dummies(df, columns=['icecream'])

# Display the one-hot encoded data
print(one_hot_encoded_df)
#       Vanille         Kiwi      Strawberry
# 0       False        False       True
# 1       False         True      False
# 2        True        False      False
# 3       False         True      False
# 4       False        False       True

Such an approach can also be applied to continuous variables but then a "break-point" or cut-off point needs to be set. This can either be set using domain knowledge or otherwise using the underlying distribution (with confidence intervals). See both examples below:

# ----------------------------------------------------------------------
# Example to one-hot encode continuous variables using domain knowledge
# ----------------------------------------------------------------------

# Import libraries
import pandas as pd

# Sample data
data = {'Temperature': [22.5, 25.0, 15, -10, -0.5]}
df = pd.DataFrame(data)

# Apply the function to the Temperature column
df['Category'] = df['Temperature'].apply(categorize_temp)

# Display the DataFrame with the new Category column
print(df)
#    Temperature Category
# 0         22.5      Hot
# 1         25.0      Hot
# 2         15.0   Normal
# 3        -10.0     Cold
# 4         -0.5     Cold

# One-hot encoding
one_hot_encoded_df = pd.get_dummies(df, columns=['Category'])
#    Temperature  Category_Cold  Category_Hot  Category_Normal
# 0         22.5          False          True            False
# 1         25.0          False          True            False
# 2         15.0          False         False             True
# 3        -10.0           True         False            False
# 4         -0.5           True         False            False

# Function to categorize temperature
def categorize_temp(temperature):
    if temperature > 15:
        return 'Hot'
    elif temp < 0:
        return 'Cold'
    else:
        return 'Normal'

An example of one-hot encoding of continuous variables using the underlying distribution can be done using the distfit library. More information can be found here:

How to Find the Best Theoretical Distribution for Your Data

pip install distfit

# ----------------------------------------------------------------------
# Example to one-hot encode continuous variables using distribution fit
# ----------------------------------------------------------------------

from distfit import distfit
import numpy as np

# Variable X
X = np.random.normal(163, 10, 10000)

# Fit distribution for most known distributions.
dfit = distfit(distr='popular', bound='both', alpha=0.05)
results = dfit.fit_transform(X)

# Get threshold for the confidence intervals
th_low = results['model']['CII_min_alpha']
th_high = results['model']['CII_max_alpha']

# Apply the function to variable X
df = pd.DataFrame(X, columns=['Value'])
df['Category'] = df['Value'].apply(discretize_X, args=(th_low, th_high))

# One-hot encoding
one_hot_encoded_df = pd.get_dummies(df, columns=['Category'])

# Function to categorize based on confidence interval
def discretize_X(X, th_low, th_high):
    if X < th_low:
        return 'Low'
    elif X > th_high:
        return 'High'
    else:
        return 'Medium'

Determine confidence intervals for continuous variables using the distfit library. Image by author.

Detect Similar Behaving Variables.

At this point, we have converted all variables (categorical, continuous) in our dataset using one-hot encoding. All new features can now be processed and analyzed by various algorithms. From a statistical point of view, there are many manners to test for similar behaving variables, aka association testing. Think of Chi-square test, Fisher exact test, the Hypergeometric test, etc. These tests are often used where one or both of the variables are either ordinal or nominal. I will elaborate more on the Hypergeometric test as it can be used to test whether two variables have more overlapping one-hot values than you would expect by chance.

For demonstration purposes, I will import the data science salary dataset which is derived from ai-jobs.net [3]. The dataset contains 11 features for 4134 samples. In the code section below are shown the variables.

# Load datazets library
import datazets as dz
# Get the Data Science salary dataset
df = dz.get('ds_salaries.zip')

# The features are as following
df.columns

# 'work_year'          > The year the salary was paid.
# 'experience_level'   > The experience level in the job during the year.
# 'employment_type'    > Type of employment: Part-time, full time, contract or freelance.
# 'job_title'          > Name of the role.
# 'salary'             > Total gross salary amount paid.
# 'salary_currency'    > Currency of the salary paid (ISO 4217 code).
# 'salary_in_usd'      > Converted salary in USD.
# 'employee_residence' > Primary country of residence.
# 'remote_ratio'       > Remote work: less than 20%, partially, more than 80%
# 'company_location'   > Country of the employer's main office.
# 'company_size'       > Average number of people that worked for the company during the year.

# Make the catagorical variables better to understand.
df['experience_level'] = df['experience_level'].replace({'EN':'Entry-level', 'MI':'Junior Mid-level', 'SE':'Intermediate Senior-level', 'EX':'Expert Executive-level / Director'}, regex=True)
df['employment_type'] = df['employment_type'].replace({'PT':'Part-time', 'FT':'Full-time', 'CT':'Contract', 'FL':'Freelance'}, regex=True)
df['company_size'] = df['company_size'].replace({'S':'Small (less than 50)', 'M':'Medium (50 to 250)', 'L':'Large (>250)'}, regex=True)
df['remote_ratio'] = df['remote_ratio'].replace({0:'No remote', 50:'Partially remote', 100:'>80% remote'}, regex=True)
df['work_year'] = df['work_year'].astype(str)

df.shape
# (4134, 8)

When we visually inspect the variables, common sense tells us that variables such as company_location is likely similar to salary_currency. This kind of assumptions can now be tested with the hypergeometric test. In other words, we can compute the probability of obtaining a certain number of successes without replacement from a specific sample size. The null hypothesis would be as follows:

There is no relation between company_location and salary_currency.

Equation: Hypergeometric test. N is the population size (4134), K is the number of success states in the population (3346), n is the sample size/number of draws (3573), and x is the number of successes in the sample (3337).

There are three parts in the Hypergeometric test as shown with the curly arrows above;

The total number of unique combinations that the company location is in the US.
The total of unique combinations that the currency is USD
All remaining combinations

If we now statistically test company_location with salary_currency, we can reject the null hypothesis under alpha=0.05 and therefore, we can speak about a statistically significant association between company_location=US and salary_currency=USD. A demonstration of the computation is as follow:

# Import scipy library
from scipy.stats import hypergeom

# Collect the counts
N = df.shape[0] # Total number of samples
K = sum(df['company_location']=='US') # Number of successes in the population
n = sum(df['salary_currency']=='USD')
x = sum((df['company_location']=='US') & (df['salary_currency']=='USD'))

# Compute association using the hypergeometric test
P = hypergeom.sf(x-1, N, n, K)

print(P)
0.0

In other words, when we see US, we also see USD. This is great to know but we only tested 2 variables and there are many more variables in the dataset. In the next section, I will demonstrate how to test across all variables in the dataset.

Detect Similar behaving Variables Across the Entire dataset.

So far, we have tested whether a particular feature in one category is associated with another feature in a different category. To determine the behavior between features, we need to cross-analyze the entire dataset. From this point on, it may be straightforward to create a for-loop to iterate through the entire dataset and perform all statistical tests. However, there are a few more challenges if we do so:

We also want to automate the process of continuous variables in the analysis.
It is important to be efficient in the number of computations and exclude mutually exclusive features. Mainly because of multiple test correction but also the computational burden.
We need to take multiple-test corrections into account.
We need to normalize the results in case of the presence of missing values.
We want to visualize the results in an manner so that we can interpret the results.

All these steps are brought back into a library named HNet. I created this method as it allows a fast way of getting insights in tabular datasets. The architecture of HNet is a multi-step process (figure below) for which the steps are shown in the figure below but are roughly; processing, statistical testing, and then multiple test correction.

Schematic overview HNet. Image by Author.

The final output of HNet is an adjacency matrix that contains the corrected P-values which represents the association between pairs of vertices. The adjacency matrix is ideal to be examined in the form of a network or heatmap.

Multiple Test Correction.

When performing hundreds or thousands of statistical tests, we need to perform multiple test corrections to prevent spurious results. The default Multiple Test Method is set to Holm in HNet. Optional are various other False Discovery Rate (FDR) or Familywise error rate (FWER) methods. Finally, we can filter on significant edges by calling an edge to be significant under alpha 0.05 (default).

Hands-on Demonstration Using The Salary Dataset.

Let's import the salary dataset which is a tabular dataset. We can easily load the entire dataset into HNet, and it will automatically pre-process the input features, perform typing (determine, categorical, continuous etc), and create an one-hot dense array. HNet works in an unsupervised manner and computes the statistical associations across all variables and subcategories. After the automated multiple test correction step, all results are stored and can be visualized. See the code section below.

pip install hnet

# Import libraries
from hnet import hnet
import datazets as dz

# Get the data science salary dataset
df = dz.get('ds_salaries.zip')

# Initialize
hn = hnet(alpha=0.05,
          y_min=10,
          multtest='holm',
          dtypes=['cat', 'cat', 'cat', 'cat', 'num', 'cat', 'num', 'cat', 'cat', 'cat', 'cat'])

# Perform the analysis
results = hn.association_learning(df, verbose=4)

# [df2onehot] >Set dtypes in dataframe..
# [df2onehot] >Total onehot features: 84
# [hnet] >Association learning across [84] categories.
# 100%   > 84/84 [00:14<00:00,  5.73it/s]
# [hnet] >Multiple test correction using holm
# [hnet] >Dropping salary
# [hnet] >Dropping salary_in_usd
# -----------------------------------------------------
# [hnet] >Total number of associatons computed: [23256]
# -----------------------------------------------------
# [hnet] >Computing category association using fishers method..
# 100%   > 11/11 [00:00<00:00, 175.45it/s][hnet] >Fin.

This results in a matrix of 84 one-hot encoded features. The default setting of HNet is the removal of one-hot encoded features if it is seen ≤ 10.

# The output is a dict that contains the following keys:
# simmatP    : Adjacency matrix containing P-values between variable assocations.
# simmatLogP : -log10(P-value) of the simmatP.
# labx       : Labels that are analysed.
# dtypes     : dtypes that are set for the labels.
# counts     : Relative counts for the labels based on the number of successes in the population.

Deep Examination Using Networks.

All relationships across the variables are stored in simmatP and simmatLogP and can be used to examine the relationships between the variables and remove any features that are highly associated (those with a low P value or high -Log10 P-value). We can explore the results using the interactive and stand-alone network using D3blocks. This network allows us to examine the network by breaking edges on its statistical association value using a slider. The slider is thus based on the -log10(P) with a (default) minimum -log10(0.05). More details about D3blocks can be found in the following blog:

D3Blocks: The Python Library to Create Interactive and Standalone D3js Charts.

# Make static plot
hn.plot(summarize=False, figsize=(50, 40))

# Make interactive network plot
hn.d3graph(summarize=False)

# Create interactive heatmap
hn.d3heatmap(summarize=False, fontsize=8)

Increasing the slider stepwise preserves the most significant relationships. These are often the variables with more similar behavior.

A network of statistically associated variables is created using D3Blocks. Moving the slider to the right breaks the edges on the -log10(P) and thus preserves the strongest associations between the variables.

The interactive demo can be found here. A network of statistically associated variables is created using D3Blocks. Moving the slider to the right breaks the edges of the -log10(P) and preserves the strongest associations between the variables.

Interactive heatmap.

The use of a heatmap is a good choice when the network becomes a giant hairball. The effectiveness of a matrix diagram depends on the order of rows and columns: if related nodes are close to each other, it is easier to identify clusters and bridges. However, path-following is difficult in a matrix view compared to a node-edge network. In this example, the colors depicts clusters computed by a community-detection algorithm. This heatmap is also created using [D3Blocks](https://towardsdatascience.com/d3blocks-the-python-library-to-create-interactive-and-standalone-d3js-charts-3dda98ce97d4).

Heatmap created using D3Blocks. Image by Author.

Addressing Feature Importance.

With the use of HNet, the variables can also be ranked on their feature importance. The -log10(P-values) from the one-hot encoded variables per category is divided by the number of significant edges per node. The highest-scoring variables demonstrate that many of the separate nodes are important in the network. Here we see that company_location, employee residence, and salary_currency are the top 3 variables.

Exclude variables and save computation time.

When we have a dataset that contains hundreds of variables with subcategories, it can become a computational burden. To speed up the computation time, input variables can be excluded from the analysis such as identifiers and names. Alternatively, we can blacklist variables to exclude them from the analysis. If variables are white-listed, the search is limited to only those edges. The resulting model will then only contain edges that are in _whitelist.

# Four methods of filtering are possible which can save you computation time.

# black_list : Excluded nodes from the analysis or plots.
# white_list : Only included the listed nodes in the analysis or plots.
# threshold  : Associations (edges) are filtered based on the -log10(P) > threshold. The threshold should range between 0 and maximum value of -log10(P).
# min_edges  : Nodes are only shown if it contains at least a minimum number of edges.

PCA Analysis On Tabular Data Using One-Hot Encoding.

A second type of analysis that can be used for tabular datasets is PCA. It will help to determine how samples are related to each based on the loadings but also allows visual inspections. More details about PCA can be found here:

What are PCA loadings and how to effectively use Biplots?

See the code section below on how to perform the PCA analysis on a tabular dataset:

# Install libraries
pip install df2onehot
pip install pca

# Import libraries
from df2onehot import df2onehot
from pca import pca

# Load dataset
df = dz.get('ds_salaries.zip')

# Store salary in separate target variable.
y = df['salary_in_usd']

# Remove redundant variables
df.drop(labels=['salary_currency', 'salary', 'salary_in_usd'], inplace=True, axis=1)

# Make the catagorical variables better to understand.
df['experience_level'] = df['experience_level'].replace({'EN':'Entry-level', 'MI':'Junior Mid-level', 'SE':'Intermediate Senior-level', 'EX':'Expert Executive-level / Director'}, regex=True)
df['employment_type'] = df['employment_type'].replace({'PT':'Part-time', 'FT':'Full-time', 'CT':'Contract', 'FL':'Freelance'}, regex=True)
df['company_size'] = df['company_size'].replace({'S':'Small (less than 50)', 'M':'Medium (50 to 250)', 'L':'Large (>250)'}, regex=True)
df['remote_ratio'] = df['remote_ratio'].replace({0:'No remote', 50:'Partially remote', 100:'>80% remote'}, regex=True)
df['work_year'] = df['work_year'].astype(str)

# One hot encoding and removing any multicollinearity to prevent the dummy trap.
dfhot = df2onehot(df,
                  remove_multicollinearity=True,
                  y_min=5,
                  verbose=4)['onehot']

# Initialize
model = pca(normalize=False)
# Fit model using PCA
model.fit_transform(dfhot)

# Make biplot
model.biplot(labels=df['job_title'],
             s=y/500,
             marker=df['experience_level'],
             n_feat=10,
             density=True,
             fontsize=0,
             jitter=0.05,
             alpha=0.8,
             color_arrow='#000000',
             arrowdict={'color_text': '#000000', 'fontsize': 32},
             figsize=(40, 30),
             verbose=4,
             )

The PCA results and the loadings of the features are shown in the biplot below. There are in total 4134 samples in the 2D space and the distance between two points describes their similarity based on the 115 one-hot encoded features. The closer the two points are, the more similar the points are in some way. In the background of the plot is shown a density map with a red glow. High and low-dense areas are stressed in such a manner. The size of each point is the salary, a higher salary is related to larger sizes. The color is based on job_title, and the marker is set to experience_level.

Biplot of the data science salary dataset. Samples are colored on the job title, markers are set to experience level, and dot size is based on the salary in USD. (image by the author).

The 1st principal component contains 18.9% of the explained variance, while the 2nd principal component accounts for 14.7%. This indicates that the first two components capture a significant portion of the explained variance in the data, making them of interest for further interpretation of sample relationships and loadings. The loadings (black arrows) illustrate which original features contribute to the variance in the principal components. The largest loading comes from the category work_year, which separates the data into two cigar-like shapes (from the bottom left to the top right), while experience level and company size create the long tail of samples extending towards the top right corner. More in-depth analysis can be found here where also t-SNE and clustering approaches are used:

Create and Explore the Landscape of Roles and Salaries in Data Science

Final Words.

Analyzing tabular dataset can be a challenge but by first creating a one-hot dataset opens up the possibility of making various analysis. I demonstrated how to detect statistically significant associations in tabular dataset using HNet and analysis using PCA. Both methods are open-source available and can be used to get fast insights in tabular datasets. The great part of HNet is that it handles all the steps from pre-processing the unstructured datasets to statistical testing and creating interactive plots. By getting insights into the statistical associations, it makes it easier to understand multicollinearity that can occur in the dataset. This means that we get a better intuition of whether two or more independent variables have highly similar behavior with each other, making it challenging to determine the individual effect of each variable on the dependent variable.

Stay Safe. Stay Frosty.

Cheers, E.

If you found this article helpful, you are welcome to follow me because I write more about data science! If you are thinking of taking a Medium membership, you can support my work by using my referral link. It is the same price as a coffee but this allows you to read unlimited articles every month!