A Guide to 21 Feature Importance Methods and Packages in Machine Learning (with Code)

Author:Murphy  |  View: 22538  |  Time: 2025-03-22 23:40:35
Image created by the author at DALL-E

"We are our choices." -Jean-Paul Sartre

We live in the era of artificial intelligence, mostly because of the incredible advancement of Large Language Models (LLMs). As important as it is for an ML engineer to learn about these new technologies, equally important is his/her ability to master the fundamental concepts of model selection, optimization, and deployment. Something else is very important: the input to the above, which consists of the data features. Data, like people, have characteristics called features. In the case of people, you must understand their unique characteristics to bring out the best in them. Well, the same principle applies to data. Specifically, this article is about feature importance, which measures the contribution of a feature to the predictive ability of a model. We have to understand feature importance for many essential reasons:

  • Time: Having too many features slows down the training model time and also model deployment. The latter is particularly important in edge applications (mobile, sensors, medical diagnostics).
  • Overfitting. If our features are not carefully selected, we might make our model overfit, i.e., learn about noise, too.
  • Curse of dimensionality. Many features mean many dimensions, and that makes data analysis exponentially more difficult. For example, k-NN classification, a widely used algorithm, is greatly affected by dimension increase.
  • Adaptability and transfer learning. This is my favorite reason and actually the reason for writing this article. In transfer learning, a model trained in one task can be used in a second task with some finetuning. Having a good understanding of your features in the first and second tasks can greatly reduce the fine-tuning you need to do.

We will focus on tabular data and discuss twenty-one ways to assess feature importance. One might wonder: ‘Why twenty-one techniques? Isn't one enough?' It is important to discuss all twenty-one techniques because each one has unique characteristics that are very worthwhile learning about. Specifically, there are two ways I will indicate in the article why a particular technique is worthwhile learning about (a) Sections titled: "Why this is important" and (b) Highlighting the word unique, to indicate that I am talking about a special and unique characteristic.

The techniques we will discuss come from two distinct areas of machine learning: interpretability and feature selection. Specifically, we will discuss the following:

Interpretability Python packages. These libraries help to make a model's decision-making process more transparent by providing insights into how input features affect the model's predictions. We will discuss the following: OmniXAI, **** _Shapas_h, _DALE_X, InterpretML, and ELI5.

Feature selection methods. These **** methods focus on reducing the model's features by identifying the most informative features, and they generally fall into the filter, embedded, and wrapper categories. The characteristics of each category will be discussed in the next section. From each category, we will discuss the following:

  • Wrapper methods: Recursive Feature Elimination, Sequential Feature Selection, Boruta algorithm.
  • Embedded methods: Logistic Regression, RandomForest, LightGBM, CatBoost, XGBoost, SelectFromModel
  • Filter methods: Mutual information, MRMR algorithm, SelectKBest, Relief algorithm.
  • Other: Featurewiz package, Selective package, PyImpetus package.

Data

To demonstrate the above feature-importance-computation techniques, we will use tabular data related to heart failure prediction from Kaggle: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction/data

The dataset has 918 rows and 12 columns corresponding to the following features:

  • ‘Age,' ‘Sex'(M, F), ‘ChestPainType' (TA, ATA, NAP, ASY), ‘RestingBP,'
  • ‘Cholesterol', ‘FastingBS'[(0,1), ‘RestingECG'(Normal,ST, LVH), ‘MaxHR',
  • ‘ExerciseAngina'(Y,N), ‘Oldpeak', ‘ST_Slope' (Up, Flat, Down),
  • ‘HeartDisease'(0,1). This is the target variable. 0 indicates the absence of heart disease, and 1 indicates the presence.

The dataset has no missing values, and the target variable is relatively balanced with 410 ‘0' ‘instances and 508 ‘1' instances. Five of the features are categorical: ‘_S_ex', ‘ChestPainType,' ‘ExerciseAngina,' ‘RestingECG,' ‘ST_Slope.' These features are encoded with the one-hot-encoding Pandas method:

<script src="https://gist.github.com/theomitsa/ef8e81c6ddc1bf00b6bcd31515b75675.js"></script>

Then, the data is split into training and test sets. Finally, _scikit-learn'_s StandardScaler is applied to the numerical data of the train and test data sets. Now, we are ready to proceed to feature importance assessment.

Feature Importance Assessment

A. Interpretability Packages

Why they are important

In recent years, the interpretability of machine learning algorithms has attracted significant attention. Machine learning algorithms have recently found use in many areas, such as finance, medicine, environmental modeling, etc. This broad use of ML algorithms by people who are not necessarily ML experts begs for more transparency because:

  • Trust issues. Black boxes make people nervous and unsure as to whether they should trust them.
  • Regulatory and ethical concerns. Governments around the world are increasingly concerned about AI use and passing legislation to ensure that AI systems make their decisions in a fair way without any biases. Understanding how ML systems work under the hood is an important prerequisite to fair and unbiased AI.

Interpretability packages are based on the model-independent interpretation frameworks SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations) [2]. SHAP uses a game theory approach and provides global explanations. On the other hand, LIME provides local explanations. They offer transparency through the idea of "explainers." An explainer is a wrapper-type of object; it can wrap around a model and provide a door to the internal intricacies of the model.

A.1 Shapash Package

Shapash [1] is such an interpretability package. Below, we see the implementation of Shapash's ‘SmartExplainer' that takes as input an ML model of type RandomForestClassifier, and the feature names. Then, _Shapash'_s ‘compile' function is invoked, which is the ‘workhorse' of the whole process: (a) it binds the model to the data, (b) computes the feature importances, and (c ) prepares the data for visualization. Finally, we invoke the interactive web application.

<script src="https://gist.github.com/theomitsa/e637d60bde4f333f1e9d13b928700390.js"></script>

Figure 1, from Shapash's application interface, shows the feature importances. The length of the horizontal bar corresponds to the importance of the feature. Thus, ‘ST_Slope_up' is the most important feature, and ‘chest_pain_typeTA' is the least important.

Figure 1. Feature importances

Figure 2 shows an important type of plot provided by Shapash, the feature contribution. The feature examined in the plot is ‘ST_Slope_up'. The upper part contains cases where the contribution of _ST_Slopeup is positive, whereas the bottom part contains cases where the contribution of ‘ST_Slope_up' is negative. Also, the upper graph part corresponds to cases where ‘ST_Slope_up' is 0, and the bottom corresponds to cases where ‘ST_Slope_up' is 1. When we click on one of the circles in the middle of the displayed structures, the following information is shown: the case number, the ‘ST_Slope_up' value, predicted class, and contribution of ‘ST_Slope_up'.

Figure 2. Feature contributions

Figure 3 shows the local explanations for slice 131, where the predicted class is 1 with probability of 0.8416. Bars to the right show a positive contribution, and bars to the left show a negative contribution to the result. ‘St_Slope_up' has the highest positive contribution, while ‘max_heart_rate' has the highest negative contribution.

Figure 3. Local explanations

In summary, Shapash is a very useful package to know because (a) it offers a great interface where the user can gain a deep understanding of global and local explanations, (b) it offers the unique feature of displaying feature contributions across cases

A.2. OMNIXAI Package

OMNIXAI [(Open-source eXplainable AI) [3], like Shapash, also offers visualization tools, but its unique strength lies in the significant breadth of its explanation techniques. Specifically, it offers methods to explain predictions for various data types, i.e., tabular data, text, and images. Some of its unique features are (a) the NLPExplainer, (b) the bias examination module, (c) Morris sensitivity analysis for tabular data, (d) the VisionExplainer for image classification, and (e) counterfactual Explainers.

The code below shows the creation of an OMNIXAI explainer. The essential steps are (a) the creation of an OMNIXAI-specific data type (‘Tabular') to hold the data, (b) Data pre-processing through the ‘TabularTransform,' (c) data splitting into training and test sets, (d) training of an XGBClassifier model (e) data inversion back to their original format (f) setting up a ‘TabularExplainer' of the XGBClassifier with both SHAP and LIME methods. The explanation will be applied to ‘test_instances' [130–135] (g) generation and display of the predictions

<script src="https://gist.github.com/theomitsa/3c44903aeac2f190589ccf7078f9a5ac.js"></script>

Figure 4 shows the aggregate local explanation for slices between [130:135] using LIME. The green bars in the right part show a positive contribution to label class 1, whereas the red bars in the left part show negative contributions to class 1. The longer the bar, the more significant the contribution.

Figure 4. LIME explanations

Figure 5 shows the aggregate local explanations for slices [130:135] using SHAP. The meaning of the green/red bars is the same as in the above graph.

Figure 5. SHAP explanations

A.3. The InterpetML Package

The InterpretML XAI interpretability [4] package has the unique feature of ‘glassbox models,' which are inherently explainable models.

The implementation of such an inherently explainable model, the ‘ExplainableBoostingClassifier,' is shown in the code snippet below. Global explanations and local explanations at slice 43 are also set up.

<script src="https://gist.github.com/theomitsa/d1e12672a783d7acc271700a1f1f2665.js"></script>

Figure 6 shows the computed global feature importances.

Figure 6. Global feature importances

Figure 7 shows the computed local explanations for slice at 43. Most features contribute positively to the prediction of class 1, while only ‘Cholesterol' and ‘FastingBS' contribute negatively.

Figure 7. Local explanations

A. 4 The Dalex Package

The Dalex package[5] is a library designed to explain and understand machine learning models. Dalex stands for "Descriptive mAchine Learning EXplanations." It has the following unique characteristics:

  • It is compatible with both R and Python.
  • The Aspects module. This allows us to explain a model taking into account feature inter-dependencies.
  • The Fairness module. It allows us to evaluate the fairness of a model.

The code snippet below shows the implementation of Dalex‘s ‘Explainer.'

<script src="https://gist.github.com/theomitsa/ad27fc281e5859758faa99b89002d7e9.js"></script>

The feature importances produced by Dalex are shown below in Figure 8.

Figure 8. Feature importances

A.5 The Eli5 package

The final interpretability package we will discuss is Eli5 [5]. It has the following unique features:

  • The permutation importance measure. In this technique, the values of each feature are randomly shuffled, and then the resulting drop in model performance is measured. The bigger the drop, the more important the feature.
  • It works with text data. Specifically, it provides a ‘TextExplainer' that can explain predictions of text classifiers.
  • It is compatible with Keras.

In the code snippet below, the ‘PermutationImportance' method is applied to the Support Vector Classification (‘svc') estimator.

<script src="https://gist.github.com/theomitsa/1ac80d2d0383bad3808c43118a2d3903.js"></script>

Figure 9 shows the computed feature importances for the ‘svc' estimator.

Figure 9.

B. Feature Selection Techniques

Wrapper Methods

As the name suggests, these algorithms wrap the feature selection process around a machine learning algorithm. They continuously evaluate subsets of features until they find the subset that yields the best performance according to a criterion. This criterion can be one of the following: model accuracy, number of subset features, information gain, etc.

Why they are important

The very nature of these algorithms (criterion optimization, comprehensive search) suggests that these methods can have very good performance in terms of selecting the best features. Another very useful characteristic of them is that they consider feature interactions. However, again, their very nature suggests that they can be computationally intensive and might overfit. So, if you do not have computational limitations and accuracy is essential, these are a good choice.

B.1. Sequential Feature Selection

Sequential Feature Selection (SFS) evaluates feature subsets in two modes: forward selection, which starts with no features and adds them iteratively, and backward elimination, which starts with all features and removes them one by one.

The code snippet below shows the implementation of SFS wrapped around a ‘KNeighborsClassifier' model. It also shows how to output the selected features and their names.

<script src="https://gist.github.com/theomitsa/623fe0b01be2f7234655e41a78598741.js"></script>

The selected features are:

<script src="https://gist.github.com/theomitsa/0ed369c390082538b4d9a0240ca78983.js"></script>

B.2 The Boruta Algorithm

Boruta is one of the most effective feature selection algorithms. Most impressively, it does not require any input from the user [7]! It is based on the brilliant idea of ‘shadow features' (randomized duplicates of all original features). Then, a random forest classifier is applied to assess the importance of each real feature against these shadow features. The process is repeated until all important features are identified.

The snippet below shows the implementation of Boruta using the BorutaPy package and the selected features.

<script src="https://gist.github.com/theomitsa/efd538e061ab561428b088f56e5bffe6.js"></script>

The selected features from Boruta are:

<script src="https://gist.github.com/theomitsa/fcb9e079685036f5815775dff62db3e6.js"></script>

B.3 The RFECV Algorithm

RFECV (Recursive Feature Elimination with Cross-Validation) is a feature selection technique that iteratively removes the least important features from a model, using cross-validation to find the best subset of features. The code implementation is shown in the snippet below.

<script src="https://gist.github.com/theomitsa/d4a766ae8e63df5e1bfdfdf79eb9ec0d.js"></script>

The selected features are:

<script src="https://gist.github.com/theomitsa/ba4bcd3b78d0b72eb3b4d145c2edf161.js"></script>

Embedded Methods

These refer to algorithms that have the built-in ability to compute feature importances or select features, such as Random Forest and lasso regression, respectively. An important note for these methods is that they do not directly select features. Instead, they compute feature importances, which can be used in a post hoc process to choose features. Such a post hoc process is ‘SelectFromModel' discussed in section B.9.

Why they are important

High-dimensional data are very common today in the form of unstructured text, images, and time series, especially in the fields of bioinformatics, environment monitoring, and finance. The greatest advantage of embedded methods is their ability to handle high-dimensional data. The reason for this ability is that they do not have separate modeling and feature selection steps. Feature selection and modeling are combined in one single step, which leads to a significant speed-up.

B.4 Logistic Regression

Logistic Regression is a statistical method used for binary classification. The coefficients of the model relate to the importance of features. Each weight indicates the direction (positive or negative) and the strength of feature's effect on the log odds of the target variable. A larger absolute value of a weight indicates that the corresponding feature is more important in predicting the outcome. The code snippet below shows the creation of the logistic regression. The hyper-parameters ‘C' (regularization strength) and ‘max_iter' are learned by applying s_cikit-learn'_s ‘GridSearchCV.'

<script src="https://gist.github.com/theomitsa/f7c243c348b338c2aa0f1ec474ac9239.js"></script>

The logistic regression coefficients are shown below.

<script src="https://gist.github.com/theomitsa/be3f93430509a01672a2912bdf238cb4.js"></script>

B.5 Random Forest

Random Forest is an ensemble machine learning method used for classification and regression. It works by building many decision trees and merging their results. It uses the bagging technique where sampling-with-replacement is applied to the dataset. Then, each sample is used to train a separate decision tree. A significant feature of Random Forest is its ability to compute feature importances during the training process. It does this by randomizing a feature (while keeping all other features constant) and then checking how much the error increased. The most common criterion for computing feature importance is the mean decrease in impurity (MDI) when a feature is used to split a node [8]. The code snippet below shows the computation of the scikit-learn ‘RandomForestClassifier,' where the hyperparameters have been determined as above using _scikit-learn'_s ‘GridSearchCV.'

<script src="https://gist.github.com/theomitsa/d2f91a726e1c1e404451cf4157cead6d.js"></script>

The code for the computation and display of feature importances is shown below. The computed feature importances are shown in Figure 10.

<script src="https://gist.github.com/theomitsa/5defb7ee2309e267e96cc814ad99aee7.js"></script>
Figure 10. Feature importances

B.6 The LightGBM algorithm

LightGBM (Light Gradient Boosting Machine) is a gradient-boosting algorithm that combines speed and performance. Developed by Microsoft, it is known for handling large datasets and for its efficiency in terms of memory and speed. Some of its __ unique features are (a) its ability to filter out data instances with small gradients and focus on more critical instances, (b) ‘Exclusive Feature Bundling'(EFB): _LightGB_M reduces the number of features by bundling mutually exclusive features (those that very infrequently are non-zero at the same time). In this way, the algorithm increases the efficiency of high-dimensional data [9].

The snippet below shows the implementation of LightGBM. The hyperparameters (‘learning rate,' ‘max_depth,' and ‘n_estimators') were chosen using _scikit-learn's ‘_GridSearchCV' algorithm. The feature importances computed from LightGBM are shown in Figure 11.

<script src="https://gist.github.com/theomitsa/76d0359acd1fa899b1934c25269db6eb.js"></script>
Figure 11.

B.7 The XGBoost Algorithm

XGBoost, which stands for eXtreme Gradient Boosting, is an advanced implementation of gradient boosting. It has the following unique characteristics:

  • It can effectively use all available CPU cores or clusters to create the tree in parallel. It also utilizes cache optimization.
  • Compared to LightGBM, XGBoost grows trees depth-wise (level-wise), while LightGBM grows trees leaf-wise. This makes XGBoost less efficient with large datasets.

The code snippet shows the implementation of XGBoost, where the hyperparameters [10] shown below were chosen based on Bayesian optimization implemented in the ‘hyperopt' package. These hyperparameters are:

  • ‘gamma' (min loss reduction for a split),
  • ‘min_child_weight' (min required sum of weights of all observations in a child)
  • ‘max_depth' (max tree depth)
  • ‘reg_lambda' (L2 regularization handle)

Finally, the hyperparameter ‘reg_alpha,' which controls L1 regularization, was set manually after experimentation.

<script src="https://gist.github.com/theomitsa/3bd6d4bd14969d85ffd5e4b9893b16de.js"></script>

Figure 12 shows the feature importances. Note that some importances are set to zero because of L1 regularization.

Figure 12. Feature importances

B.8 The CatBoost Algorithm

CatBoost [11] is a high-performance, open-source gradient boosting library, particularly well-suited for categorical data. Specifically, it does not require any pre-processing of categorical variables, such as label-encoding or one-hot-encoding. Instead, it handles categorical variables natively. CatBoost employs symmetric trees as its base predictors and supports GPU acceleration. Regarding CatBoost implementation in Python, it is important to note that all non-numeric features must be declared as type ‘category.' Then, as shown in the snippet below, the categorical features are provided as input to the model's fit function.

Figure 13 shows the feature importance computed by CatBoost. It is important to note that the names of the features are the ones in the original data set (not the one-hot-encoded). Because CatBoost handles categorical data natively, the input to the CatBoost algorithm was the original data (not one-hot-encoded).

<script src="https://gist.github.com/theomitsa/00f5008cd136ce62efd1fef2064dbe81.js"></script>
Figure 13. Feature importances

B.9 The SelectFromModel Method

‘SelectFromModel,' is offered by scikit-learn's feature.selection package. Its unique characteristic is that it is a meta-transformer that can be used with models that assign importances to features, either through _coef or _feature_importances.

In contrast to the previous embedded methods we discussed, which just computed feature importances, ‘SelectFromModel' actually selects features. The snippet below shows the code for feature selection using this method.

<script src="https://gist.github.com/theomitsa/7e09a573c774cd5d07da8c4d1a73b340.js"></script>

The selected features are:

<script src="https://gist.github.com/theomitsa/d25fb5004662a435eff812845659a710.js"></script>

Filter Feature Selection Methods

These are independent of any machine learning model. They typically rely on statistical measures to evaluate each feature, such as correlation and mutual information between the target and predictor variables.

Why they are important

Filter methods are straightforward and very easy to compute and, therefore, are used as an initial feature selection step in many fields with large amounts of data, such as bioinformatics[12], environmental studies, and healthcare research[13].

B. 10 Mutual information

The mutual information measures the reduction in uncertainty (entropy) in one variable, given knowledge of the other. The mutual information between the predictors and the target variable is computed using _scikit-learn'_s _mutual_infoclassif. The mutual information score of each predictor is shown in Figure 14.

<script src="https://gist.github.com/theomitsa/9341a773867af52fda2edcb50086d1e9.js"></script>
Figure 14. Mutual information scores.

B.11 The MRMR Algorithm

MRMR stands for _M_axium-_R_elevancy-_M_aximum-_R_edundancy. As the name indicates, the MRMR algorithm selects features that are (a)Maximally relevant, i.e., strongly correlated with the target variable, (b) Minimally redundant, i.e., exhibit high dissimilarity among them. Redundancy can be computed using correlation or mutual information measures, and relevance can be calculated using the F-statistic or mutual information[15]. MRMR is a minimal-optimal method because it selects a group of features that, together, have maximum predictive power [14]. This is in contrast to the Boruta algorithm, discussed in section B.2, which is an all-relevant algorithm because it identifies all features relevant to the model's prediction.

The code snippet below shows the implementation of MRMR with the ‘mrmr' Python library.

<script src="https://gist.github.com/theomitsa/2c4a851a1abf9b5c0e146846b3f81776.js"></script>

The minimal-optimal set of selected features is shown below:

<script src="https://gist.github.com/theomitsa/a3605ce6bd0584ba156b2fb12bf8c732.js"></script>

B.12 The SelectKBest Method

As the name suggests, this algorithm selects the K best features according to a user-defined score. The number K is also user-defined. The algorithm can be applied to both classification and regression tasks, and it offers a variety of scoring functions. For example, for classification, the user can apply the following: (a) ‘f_classif,' which computes the ANOVA F-value, (b) ‘mutual_info_classif,' which computes mutual information, and (c) chi2, which computes chi-squared statistics, between the predictors and the target variable [16]. The code snippet below shows the computation of SelectKBest for k=5 and score function ‘f_classif'.

<script src="https://gist.github.com/theomitsa/605120fdd7769b47b505fc4ea794147a.js"></script>

Figure 15 below shows the scores (importances) of the features according to the scoring function ‘f_classif.' Note that although we chose K=5, Figure 15 displays the scores for all features.

Figure 15. Feature importances.

B.13 The Relief Algorithm

Relief ‘s unique characteristic is the following idea: For a data sample, find its closest neighbor in the same class (‘near hit') and the closest neighbor in the other class (‘near miss'). Features are weighted according to how well they are similar to the ‘near hit' and how well they differentiate from the ‘near miss' sample. Relief is particularly useful in biomedical informatics because of its sensitivity to complex feature associations [17]. Here, we used an extension of the original Relief algorithm, the ReliefF algorithm, which can be applied to multi-class classifications. In contrast, the original Relief algorithm can only be applied to binary classification cases. The snippet below shows the invocation of the ‘ReliefFselector' from the ‘kydavra' Python package.

<script src="https://gist.github.com/theomitsa/4c2a79738b56c1142a01273e6faa5c23.js"></script>

The selected features from the algorithm are shown below.

<script src="https://gist.github.com/theomitsa/f2733cdb933dbc54e9fab10f1129e973.js"></script>

Misc Feature Selection Techniques

In this final category, we will discuss the Featurewiz, Selective, and PyImpetus packages.

Why they are important

Each package is important for its unique reasons: (a) Featurewiz is a very convenient AutoML package. It selects features with one line of code; (b) The Selective package offers a wide variety of filter and embedded filter selection methods that can be easily invoked with one line of code; (c) The PyImpetus package is based on an algorithm that is very different from all other feature selection techniques, the Markov Blanket.

B.14 The Featurewiz Package

This is an automated feature selection tool [18][19]. Its invocation is as simple as shown in the code snippet below. Under the hood, it uses the ‘SULOV' algorithm (_S_earching for _U_ncorrelated _L_ist Of _V_ariables), whose basis is the MRMR algorithm described above in section B.11. ‘SULOV' selects the features with the highest mutual information score and the smallest correlation among them. Then, the features are passed recursively through XGBoost to find the best subset.

<script src="https://gist.github.com/theomitsa/cf9275895ecdd0b09d6d3579c087db9e.js"></script>

The features selected from Featurewiz are shown below.

<script src="https://gist.github.com/theomitsa/a31e541dbb2cd5a16bbd452eb4f13944.js"></script>

B.15. The Selective Feature Selection Library

This library provides numerous feature selection methods for classification and regression tasks [20]. Some of the methods offered are: correlation, variance, statistical analysis (ANOVA f-test classification, chi-square, etc.), linear methods (linear regression, lasso, and ridge regularization, etc.), and tree-based methods (Random Forest, XGBoost, etc.). An example of this library's usage is shown below.

<script src="https://gist.github.com/theomitsa/b6d32fdfc8c66e7243fd172907c6b38a.js"></script>

The selected features using the ‘TreeBased' method are:

<script src="https://gist.github.com/theomitsa/65b60560b50a1d06f9cfdf81e3c3040c.js"></script>

B.16. The PyImpetus Package

The unique idea of this algorithm is the Markov Blanket, which is the minimal feature set needed to predict the target variable[21][22]. It can be used for both classification and regression tasks. Its implementation for classification is shown below.

<script src="https://gist.github.com/theomitsa/15078a348191133ad55de217938b03c8.js"></script>

Figure 16 shows the selected features and their relative importance.

Figure 16.

Discussion and Conclusion

In this article, we discussed a broad spectrum of feature importance assessment techniques from two distinct realms: interpretability and feature selection. Given the diversity of the discussed algorithms, a question that arises naturally is:" How similar are the features selected as most important by the various algorithms?"

Let us take a look at Table 1 below. This table has two columns corresponding to the features ‘ST_Slope_up' and ‘ST_Slope_flat.' The rows correspond to the algorithms and packages we used in the article. The numbers 1,2,3 indicate whether the feature was selected as the best, second best, or third best by the algorithm.

<script src="https://gist.github.com/theomitsa/fef3c8725d75f00b327b2896cad446c5.js"></script>

As discussed in the article, some algorithms simply output a set of features without any order. In this case, the X in the table indicates that the algorithm selected the feature. In the case of a gap in the table, the feature was not in the three best features selected by the corresponding algorithm. For logistic regression, the absolute value of the coefficients was considered. For CatBoost, we assigned a 1 value to both ‘ST_Slope_up' and ‘ST_Slop_flat' because CatBoost selected ‘ST_Slope' as the most important feature. Finally, the OMNIXAI package results were not included because they provided local explanations for a few rows.

An interesting fact emerges from the observation of Table 1. Except for LightGBM, the feature ‘ST_Slope_up' had the highest or second highest importance in __ the algorithms that report feature importances. It was also selected by most algorithms that reported selected features without importances. The feature ‘ST_Slope_Flat' also performed pretty well because it was either in the first three highest-importance features or in the selected feature groups for most algorithms.

Now, let us delve into another interesting insight. These two features had the highest and second-highest mutual information scores. As we saw in section B.10, this is a straightforward feature computed in one line of code. So, with one line of code, we gained insights into the most important features of our data, as reported by the other significantly more computationally complex algorithms.

This article discussed twenty-one packages and methods that compute feature importance, a measure of a feature's contribution to a model's predictive ability. For further reading, I recommend [23], which discusses another role of features, specifically their error contribution to a model.

The entire code can be found at https://github.com/theomitsa/Feature_importance/tree/main.

Thank you for reading!

Footnotes:


References

  1. The Shapash package,https://shapash.readthedocs.io/en/latest/
  2. Molnar, C., Interpretable Machine Learning, 2023. https://christophm.github.io/interpretable-ml-book/
  3. The OMNIXAI package, https://opensource.salesforce.com/OmniXAI/latest/omnixai.html
  4. The InterpretML package, https://interpret.ml/
  5. The Dalex package, https://dalex.drwhy.ai/
  6. The Eli5 package, https://eli5.readthedocs.io/en/latest/index.html
  7. Mazzanti, S., Boruta Explained Exactly How You Wished Someone Explained to You, Medium: Towards Data Science, March 2020.
  8. Scornet E., Trees, Forests, and Impurity-Based Variable Importance, 2021, ffhal-02436169v3f, https://hal.science/hal-02436169v3/file/importance_variable.pdf
  9. Ke, G. et al., LightGBM: A Highly Efficient Gradient Boosting Decision Tree, NIPS Conference, pp. 3149–3157, December 2017.
  10. Banerjee, P., A Guide on XGBoost Hyperparameters Tuning, https://www.kaggle.com/code/prashant111/a-guide-on-xgboost-hyperparameters-tuning
  11. Prokhorenkova, L. et al., CatBoost: Unbiased Boosting With Categorical Features, NIPS'18: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 6639–6649, December 2018.
  12. Urbanowicz, R.J. et al., Benchmarking Relief-Based Feature Selection Methods for Bioinformatics Data Mining, Journal of Biomedical Informatics, vol.85, pp.168–188, Sept. 2018.
  13. Raju, S.K., Evaluation of Mutual Information and Feature Selection for SARS-CoV-2 Respiratory Infection, Bioengineering (Basel), vol. 10, no. 7, July 2023.
  14. Mazzanti, S., "MRMR" Explained Exactly How You Wished Someone Explained to You, Medium: Towards Data Science, February 2021.
  15. Radovic, M. et al., Minimum Redundancy Maximum Relevance Feature Selection Approach for Temporal Gene Expression Data, BMC Bioinformatics, January 2017.
  16. Kavya, D., Optimizing Performance: SelectKBest for Efficient Feature Selection in Machine Learning, Medium, February 2023.
  17. Urbanowicz, R. J. et al., Relief-Based Feature Selection: Introduction And Review, Journal of Biomedical Informatics, vol. 85, pp. 189–203, Sept. 2018.
  18. The Featurewiz package, https://github.com/AutoViML/featurewiz
  19. Sharma, H., Featurewiz: Fast Way to Select the Best Features in Data, Medium, Medium: Towards Data Science, Dec.2020.
  20. The Selective Feature Selection Library, https://github.com/fidelity/selective
  21. The PyImpetus package, https://github.com/atif-hassan/PyImpetus
  22. Hassan, A. et al., PPFS: Predictive Permutation Feature Selection, https://arxiv.org/pdf/2110.10713.pdf
  23. Mazzanti, S., Your Features Are Important? It Doesn't Mean They Are Good, Medium: Towards Data Science, August 2023.

Tags: Data Science Feature Importance Interpretability Machine Learning Tips And Tricks

Comment