How strongly associated are your variables?

Author:Murphy  |  View: 29479  |  Time: 2025-03-23 19:36:49

Introduction

Feature selection is an important step for any Data Science project. I can imagine you heard it a million times if you're not new to the field, and I am sure you already heard it if you're a newbie, but I will say it again: If you feed your model with garbage, you will collect garbage as a result.

Ok, now that we took that our of our chest, let's move on. There are a couple of good ways to select the best features for you model, like running a Random Forest model and then checking the feature_importances_ attribute, using sklearn'sSelectKBest, performing statistical tests separately, and other techniques.

These automated tests from sklearn are very handy and excellent options for us to make our Feature Selection fastly. They are an automated way to use statistical tests like F-Test, correlation, chi squared and quickly perform hypothesis tests to choose variables based on the results.

When we're talking about categorical variables, for example, if we run the SelectKBest, we'll have to use the scoring function chi2 to find out if the p-values are under the threshold for statistical significance of the dependency of the variables.

Ho = The variables are independent

Ha = The variables are not independent

A common significance level used is 0.05.

However, the returned value is only the p-value and test statistic. This means that the tool aims for giving us only the quick result, such as a p-value under the threshold, confirming we have evidence to reject the null-hypothesis and pointing to both categorical variables being dependent. But it won't tell you how strong is this association.

An easy workaround is to perform the Cramer's V test, to be presented in this post.

Before we continue, let me present the dataset used for the examples in this post. It's the diamonds dataset, an open sample data from the Seaborn package.

Python">import seaborn as sns

# Load the dataset
df = sns.load_dataset('diamonds')
Diamonds Dataset from seaborn. Image by the author.

This dataset has observations of diamonds cuts, color, size, carat and prices. Our intention is to check the association between the categorical variables cut, color, clarity and the price.

Feel free to import the packages to code along.

import pandas as pd
import scipy.stats as scs
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

Using Select K Best

Ok, once we introduced the dataset and the test to be performed by the SelectKBest tool, let's see how it works and the results from it.

First, as we will check the association between two categorical variables, let's make the price variable become categorical. For that, I will separate the prices of the diamonds in bins:

  • cheaper: From 0 to 20% under the average
  • on_average: From Avg – 20% to the average
  • high_price: From the Average to Avg + 20%
  • expensive: From Avg + 20% to the max value
# Create a price bin variable
df['price_bins'] = pd.cut(
    df['price'],
    bins= [0, df.price.mean()*0.8, df.price.mean(), df.price.mean()*1.2, np.inf],
    labels= ['cheaper', 'on_average', 'high_price', 'expensive']
    )

This is how it will look like.

price_bins variable added. Image by the author.

Next, we will have the data split in X (explanatory) and y (explained).

# Split X and y
X = df.drop(['price', 'x', 'y', 'z', 'depth', 
             'table', 'carat', 'price_bins'], axis=1)
y= df.price_bins

In the sequence, we select the categorical variables and encode the values.

# Select categorial variables
categorical_vars = X.select_dtypes(include='category').columns.to_list()

# Encode the categorial variables
X[categorical_vars] = X[categorical_vars].apply(lambda x: x.cat.codes)

With this, we'll be ready to fit the data and extract the results.

# Instance of SelectKBest
fsel= SelectKBest(score_func=chi2, k=3)

# Fit
fsel.fit(X, y)

# Show a dataframe of the results
(
    pd.DataFrame({
    'variable': X.columns,
    'chi2_stat': fsel.scores_,
    'p_value': fsel.pvalues_})
    .sort_values(by='p_value', ascending=False)
)

The resulting data frame is in the next picture.

Results of the SelectKBest tool. Image by the author.

That is great! Now we can see that the 3 features are dependent on the prices with statistical significance of 95% (p-values < 0.05; reject Ho). But how can we make sure that those associations are strong or not?

Let's perform the Cramer's V test and find out.

Cramer's V test

If we do a quick research, we will find out that:

Cramer's V is a measure of association between two categorical variables that returns a value between 0 (weak) and 1 (strong).

To perform it in Python, we need Pandas to create a contingency table and scipy to run the Chi² test, that will lead to the final calculation of the V number.

Ok, so from our dataset, let's create a contingency table between cut and our recently created variable price_bins.

# Creating a contingency table
cont_table = pd.crosstab(index= df['cut'], 
                         columns= df['price_bins'])

Next, is the resulting contingency table. This is nothing more than calculating how many observations are present in each pair of association. For example, there are 14,181 diamonds with an "Ideal" cut and "cheaper" price. 735 diamonds are "Premium" cut and have a price "on average".

Contingency table Cut x Price Bins. Image by the author.

To perform the Chi² now, we will use the functionchi2_contingency from scipy. So we pass the contingency table to the function. It returns a lot of interesting information: (1) Chi² statistic; (2) p-value; (3) Degrees of freedom; and(4) expected values. As we need only the Chi² for this test, we take the first index chi_stat = X2[0].

# Chi-square value
X2 = scs.chi2_contingency(cont_table)
chi_stat = X2[0]

# Print X2
X2

[OUT]
(1603.5199669055353,
 0.0,
 12,
 array([[12378.84006303,  1318.4705228 ,  1509.44898035,  6344.24043382],
        [ 7921.51562848,   843.72080089,   965.93248053,  4059.8310901 ],
        [ 6939.87033741,   739.16573971,   846.23277716,  3556.73114572],
        [ 2817.9940304 ,   300.14460512,   343.6200964 ,  1444.24126808],
        [  924.77994067,    98.49833148,   112.76566555,   473.95606229]]))

Now, let's calculate the Cramer's V value. The formula for V is:

V formula: X² is the chi squared statistic; N is the sample size; k is the minimum number between the number of categories in rows and columns.

In Python, the calculation is performed with the next code snippet.

# Size of the sample
N = len(df)
# Minimum dimension
# Minimum between Number of categories in rows-1, # categs columns-1
minimum_dimension = (min(cont_table.shape)-1)

# Calculate Cramer's V
result = np.sqrt((chi_stat/N) / minimum_dimension)

# Print the result
print(result)

[OUT]
0.09954537514956

Cramer's V strength value for the association between the Cut and the Price bins is 0.099, or 9.9%, which can be understood as something between a small to medium effect (see the table for values interpretation here) for 3 degrees of freedom (minimum_dimension = 3).

For clarity , here's the result:

# Creating a contingency table
cont_table = pd.crosstab(index= df.clarity, 
                         columns= df['price_bins'])

# Chi-square value
X2 = scs.chi2_contingency(cont_table)
chi_stat = X2[0]

# Performing Cramer's V calculation

# Size of the sample
N = len(df)
# Minimum dimension
minimum_dimension = (min(cont_table.shape)-1)

# Calculate Cramer's V
result = np.sqrt((chi_stat/N) / minimum_dimension)

# Print the result
print(result)

[OUT]
0.18476912508901078

For clarity, the number 0.18 is around the medium strength. For color, V = 0.115, also in the small effect range.

Before You Go

This subject called my attention because, indeed, many times we just go for automated solutions like sklearn's SelectKBest and we take for granted their p-value as the single criterium to decide which variable to use for our model.

Now that you got to the end of this post, you have another statistical tool to select the best variables for a model.

In summary:

  1. Select two categorical variables
  2. Create a contingency table with pd.crosstab()
  3. Run the scs.chi2_contingency(contingency_table) to collect the Chi² statistic.
  4. Calculate Cramer's V: np.sqrt((chi_stat/N) / minimum_dimension)

Code:

Studying/Cramers_V.ipynb at master · gurezende/Studying

If you liked this content, follow my blog for more. Find me on Linkedin.

References

Cramér's V – Wikipedia

How to Interpret Cramer's V (With Examples) – Statology

Contingency Tables, Chi-Squared and Cramer's V

scipy.stats.chi2_contingency – SciPy v1.10.1 Manual

How to Interpret Cramer's V (With Examples) – Statology

Cramér's V – Beginners Tutorial

Tags: Data Science Feature Selection Python Statistical Test Statistics

Comment