Feature Subset Selection

TL;DR
Feature subset selection is important in supervised machine learning not just because it results in better models but also because of the insight it provides. This is particularly important now with the emphasis on interpretability in machine learning (ML).
The challenge for practitioners is that there is a bewildering array of Feature Selection methods available. In this post I provide a brief overview of this landscape and propose a strategy that will work in most cases. This strategy uses a Wrapper for the final selection process and, where necessary, permutation importance as an initial filter.
Introduction
In data analysis, objects described using multiple features may sometimes be described using a subset of these features without loss of information. Identifying these feature subsets is termed feature selection, variable selection or feature subset selection and is a key process in data analysis. This post provides a brief overview of feature subset selection (FSS) methods and also proposes a strategy that will work in most scenarios. This post is based on a tutorial paper available on arXiv [1]. Python code for the methods presented in that paper is available on Github.
Feature selection receives a lot of attention in ML because it can deliver a number of benefits:
- Better classifiers: The obvious benefit of feature selection is that it will improve accuracy because redundant or noisy features can damage accuracy. Perhaps surprisingly, improvements in accuracy can be quite limited because powerful ML techniques are designed to be robust against noise.
- Insight: Perhaps the most enduring benefit of feature selection is the insight it provides. Identifying influential features and features that are not useful teaches us a lot about the data.
- Data Gathering: In domains where data comes at a cost (e.g. Medical Diagnosis, Manufacturing), identifying a minimal set of features for a classification task can save money.

The main strategies for FSS are summarised in Figure 1. Other surveys of feature selection [2,3] divide feature selection methods into three categories and we follow the same structure:
- Wrappers are feature selection methods where the classifier is wrapped in the feature selection process (see Figure 2). This wrapping allows classification performance to drive the feature selection process. This has the advantage of tying the feature selection to classifier performance but this comes with a significant computational cost as very many classifier variants will be evaluated during the selection.
- Filters cover methods that use criteria other than classifier performance to guide feature selection. Typically a filter provides a feature ranking and then a selection policy uses this ranking to select a feature subset.
- Embedded methods refer to any method where the feature selection emerges as a by-product of the classifier training process. For instance, training a decision tree will almost always select a subset of the available features to build a tree.

Methodology
When evaluating the performance of feature selection strategies we typically want to know how things will generalise to unseen data. As a proxy for this we hold back some data for testing (option (b) in Figure 3). If we wish to assess a few different feature selection alternatives as part of the model development then these should be tested within the confines of training data and cross validation is the most effective way to do this (option (c)). It should be remembered that if the objective is to perform feature selection as part of the deployment of an ML system then all the available data can be used for feature selection (option (a) in Figure 3).

Before proceeding we need to introduce the notation that will be used. Assume we have a dataset D made up of n data samples. D = ⟨X, y⟩ where y are the class labels. The examples are described by a set of features F where p = |F| so there are n objects described by p features. So X has dimension n×p and y is a vector of length n. The objective is to identify a subset S ⊂ F that captures the important information in the dataset. After feature selection the dataset is reduced to X′ with dimension n×k where k = |S|.
Some summary statistics of the datasets used in this tutorial as shown in Table 1. These datasets are available in the GitHub repository.

Wrappers
If |F| is small we could in theory try out all possible subsets of features and select the best subset. In this case ‘try out' would mean training and testing a classifier using the feature subset. This would follow the protocol presented in Figure 3 (c) where cross-validation on the training data would identify a good feature subset and then this could be tested on the test data. However the number of possibilities is 2ᵖ so exhaustive search quickly becomes impossible – for instance if p=20 there are over 1 million possibilities to consider.
Nevertheless this is how a Wrapper feature selection strategy works with the important modification that the search can be greedy or stochastic rather than exhaustive. The general idea is shown in Figure 2(a), the classifier is wrapped in the feature selection process, i.e. classifiers trained using the feature subsets are used in the search process. The feature subsets will be evaluated using hold-out testing or cross-validation testing on classifiers built using the data. The main search strategies used with Wrappers are:
- Exhaustive Search evaluates every possible feature subset. If the number of features to be considered is small it will be possible to consider all feature combinations. However, if p > 20 there will be millions of feature subsets to be considered and an exhaustive search will not be practical.
- Sequential Forward Selection (SFS) starts with no features selected and all classifiers incorporating a single feature are considered (see Figure 4 (a)). The best of these is selected and then two feature combinations including this feature are evaluated. This process proceeds, adding the winning feature at each step, until no further improvements can be made.
- Backward Elimination (BE) proceeds in the opposite direction to FSS, it starts with all features selected, considers the options with one feature deleted, selects the best of these and continues to eliminate features (see Figure 4 (b)). Again, the process is terminated when no improvements can be made.
- Stochastic Search methods such as genetic algorithms or simulated annealing can readily be applied to Wrapper feature selection. Each state can be defined by a feature mask on which crossover and mutation can be performed [4]. Given this convenient representation, the use of a stochastic search for feature selection is quite straightforward although the evaluation of the fitness function (classifier accuracy as measured by cross-validation) is expensive.

Our exploration of Wrappers will focus on SFS and BE. These are greedy strategies that explore the search space of possible feature subsets as shown in Figure 4. SFS starts with an empty set and proceeds forward considering classifiers built on single features. The best of these is selected and then pairs of features incorporating this feature are considered. The process could terminate when the addition of a new feature doesn't result in any improvement. As the name suggests, Backward Elimination works in the opposite direction. It starts with a full set of features (Figure 4(b)) and eliminates the least useful feature at each step. For both SFS and BE, the feature subsets are evaluated using cross-validation on the training data. The evaluation is done on the Segmentation dataset and the classifier used is k-Nearest Neighbor (k-NN) as it is quite sensitive to noisy or redundant features. The Python notebook is available on Github.
Both methods have their own advantages and disadvantages as can be seen in Figure 5. SFS is inclined to require less computation as the models being evaluated are smaller, typically a classifier with a small number of features will take less time to train and test. SFS is inclined to select less features (see Figure 5(a)); this parsinomy is typically an advantage. On the other hand, because BE starts with larger feature sets, it can do a better job of assessing how features work in combination.
The overall results for SFS and BE are shown in Figure 5(b). SFS selects seven features and 11 are selected by BE. Both feature subsets result in improved accuracy on the training data but only the SFS subset results in better accuracy on the test data. Indeed the gap between train and test accuracy for BE is evidence of overfitting – the selection process has fitted too closely to the characteristics of the training data at the cost of generalisation accuracy. Indeed overfitting is recognised to be a problem with Wrapper-based feature selection [4].

Filters
Figure 2 (a) shows how Wrapper strategies use the classification algorithm in the feature selection process. Figure 2(b) shows that Filter strategies do not use the classifier for feature selection, instead a separate evaluation function is used. The fact that Filters are independent of the classifier is a mixed blessing. It means that Filters can be much faster than Wrappers but the selected features may not be in tune with the inductive bias of the classifier.
A Filter will entail a feature scoring mechanism and then a selection strategy based on these scores. The scoring mechanism needs to quantify how much information the feature has about the outcome. The selection strategy might be:
- Select the top ranked k features,
- Select top 50%,
- Select features with scores > 50% of the maximum score,
- Select features with non-zero scores,
- A hybrid Filter/Wrapper strategy whereby features are ranked using a filter and then the performance of subsets based on this ranking is evaluated.
We will now look at three Filter strategies – the Chi-square statistic, information gain and permutation feature importance.
The Chi-square statistic is a measure of independence between a feature and the class label. If samples are organised into a contingency table as shown in Figure 6, how different are the cell counts to what would be observed by chance? The data in Figure 6(a) suggests that handedness is independent of gender because the proportions are the same. The data in (b) suggests that gender is predictive of handedness.

The Chi-square statistic allows us to quantify this:

The statistic is a sum over the m cells. For each cell we consider the difference between the observed count Oᵢ and the expected count Eᵢ if the feature and the class were independent. In Figure 6(a) this difference would be zero because the feature and the class are independent. In (b) there would be a difference so the statistic would be positive. In general, the greater the dependence the larger the statistic. If the feature values are numeric rather than categorical then the feature values can be binned to enable the construction of the contingency table [5].
Information gain is an alternative information-theoretic measure quantifying the information a feature contains about a class [6]. In Figure 6(b) by knowing the gender we gain information about handedness. In a binary classification scenario, let's assume the probability of a positive and negative outcomes are respectively p and q. Then the entropy of a dataset based on these proportions is:

then the information gain for any feature f in the dataset in terms of the class label is:

As with the Chi-square statistic, information gain (I-Gain) allows us to rank features for the purpose of feature selection. This is illustrated in Figure 7. This shows the Segmentation features ranked by both measures. The Python notebook is available at this GitHub <link>. The plot shows the scores sorted by I-Gain score. It is clear that the scores are well correlated (Pearson correlation score of 0.86) so feature subsets selected based on these scores should be reasonably similar.

This does prove to be the case when we look at the performance of classifiers built with feature subsets based on these rankings. In Figure 8 we see the results of a range of top k selection policies (k = 3, 6, 10, 15). At k = 10 both scores select a feature subset that produces accuracy on the test set equivalent to that obtained with the full feature set. The evaluation strategy here conforms to pattern (b) in Figure 3, the feature scoring is done using the training data and then tested on the test set.

Permutation Feature Importance is based on the principle that if we want to find out how important something is to a process we can break it to see what happens. To score the importance of a feature we can permute values for that feature in a test set to see what is the impact on the overall accuracy. If the error increases significantly when a variable is noised in this way that variable is important. If the error does not increase that variable is not useful for the classification. The overall process is as follows:
- Fit a classifier on the data
- Calculate a baseline accuracy
- Shuffle feature values and calculate accuracy again
- Measure increase in error against error without shuffling
This process is typically repeated multiple times (say 10) to get more stable feature importance scores. Permutation importance is used as the first stage in the proposed strategy presented at the end of this article.
In Figure 9 we see the permutation importance scores for _k-_NN and Gaussian Naive Bayes on the Segmentation datasets (notebook <here>). We can see that the rankings are reasonably well correlated but by no means the same. This difference is down to the different classifiers ‘preferring' different features and to an inherent instability is feature selection methods.

Embedded Methods
In this section we cover feature selection methods that emerge naturally from the classification algorithm or arise as a side effect of the algorithm. We will see that with Decision Trees and Logistic Regression feature selection can be an integrated part of the model building process.
Decision Trees: The construction of a Decision Tree from a data set will very often entail feature selection as some of the features will not appear in the tree. Features not included in the tree are effectively selected out. We show an example of this on the Penguins dataset in Figure 10.

In this example the dataset has been divided 50:50 into train and test sets. This tree has been trained on the training data and has 93% accuracy on the test data (see notebook <here>). This dataset has four features, flipper length, bill length, bill depth and body mass. It is clear from the tree in Figure 10 that three of the four features are selected, body mass is not selected.
This tree has been constructed with the default scikit-learn parameters so there is no pruning. It is normal in Decision Tree learning to constrain (i.e. prune) the size of the tree to prevent overfitting. The use of pruning to prevent overfitting will push the feature selection further as even less features will be selected in smaller trees.
Logistic Regression, Lasso: In multivariate linear models such as linear regression or logistic regression, feature selection can be achieved as a side effect of regularization. In ML regularization refers to mechanisms designed to simplify models in order to prevent overfitting. Thus regularization can cause features to be deselected. Elastic net and Lasso are popular regularization methods for linear models. Here we will provide an overview of how Lasso works [7] and present examples of Lasso in operation. Starting with the basics, a multivariate regression works as follows:

The dependent variable y is a linear function of the input features; for each feature xᵢ the weight of that feature is determined by the corresponding βᵢ parameter. For binary classification problems ([0,1] labels) we can use logistic regression where the dependent variable is the log odds that an outcome variable is 1. If pr is the probability that the label is 1 then the odds is pr/(1-pr).

So logistic regression provides a class probability:

Regularization prevents overfitting by limiting model capacity; this is done by limiting the size of weights. The two options are L₁ or L₂ regularization:

So the β parameters in are fitted to the training data subject to these L₁ or L₂ constraints. It transpires that when an L₁ regularization is used the weaker weights will go to zero, i.e. those features will be deselected. There is an excellent explanation of why this happens in the original Lasso paper by Tibshirani [7].
To demonstrate this on our sample datasets we reduce them to binary classification problems to make the overall process more transparent (notebook <here>). However, feature selection using Lasso also works with multiclass problems. The results are shown in Figures 11 and 12. Because the datasets have been reduced to just two classes (Cement and Window for Segmentation and Adelie and Chinstrap for Penguins) the accuracies are higher than for the multi-class scenario.


The extent of the feature reduction with Lasso is controlled by the regularization parameter C. Results are included for two levels of regularization, C=10 and C=1. C=10 results in less regularization so more features are retained. In both cases the default regularization results in too much feature reduction and generalization accuracy is reduced. For the Penguins dataset just two features are retained while three are retained in the Segmentation dataset (see Figure 15). The milder regularization retains more features resulting in no loss of generalization accuracy.


Proposed Strategy
As stated at the beginning, our proposed strategy will use Wrapper for the final selection process and, where necessary, permutation importance as an initial filter (GitHub <link>). The example we present here used the Ionosphere dataset from the UCI repository. This dataset has 34 features so there are over 17 billion possible feature subsets. Even with a greedy search strategy (e.g. SFS or BE), applying a Wrapper to the full feature set would be very computationally expensive. So we use the two step strategy shown in Figure 13; we use permutation importance to reduce the 34 features to a candidate set of 18 and then employ a Wrapper.

The classifier used in this example is _k-_NN and we hold back 50% of the data for final testing. The scores from the permutation feature importance stage are shown in Figure 14. We drop the features that don't have a positive score, leaving us with 18 features to pass to the Wrapper stage.

We then run a Wrapper search considering these 18 features, using the training data only. We use backward elimination as described earlier and it selects 16 features. In Figure we see accuracy estimates for the three feature sets. The estimates on the training data use cross validation and the hold-out estimate shows accuracy on the 50% held back from the feature selection process.

We see that both feature selection stages produce accuracy improvements on the training data but for the Wrapper stage we see no further improvement on the test data. This is in line with other research that shows that in-depth feature selection effort can result in overfitting.
Conclusion
So that concludes our overview of state-of-the-art feature subset collection methods. One of the challenges is that there are so many alternative methods to choose from; the strategy we propose is:
- If you have loads of training data, the two stage strategy presented above will be worth while.
- If training data is scarce and there is a risk of overfitting you could stop feature selection after the permutation importance step.
Python implementations (using scikit-learn) for the methods covered here are available in this GitHub repository.
References
- Cunningham, P., Kathirgamanathan, B., & Delany, S. J. (2021). Feature Selection Tutorial with Python Examples. arXiv preprint arXiv:2106.06437.
- Isabelle Guyon and André Elisseeff. (2003) An introduction to variable and feature selection. Journal of machine learning research, 2003.
- Luis Carlos Molina Félix, Luis Antonio Belanche Muñoz, and M Àngela Nebot Castells (2002) Feature selection algorithms: a survey and experimental evaluation. In 200_2 IEEE International Conference on Data Mining (ICDM 2002) pag_es 306–313.
- John Loughrey and Pádraig Cunningham. Overfitting in wrapper-based feature subset selection: The harder you try the worse it gets. In International Conference on Innovative Techniques and Applications of Artificial Intelligence, pages 33–43. Springer, 2004.
- Xin Jin et al. "Machine learning techniques and chi- square feature selection for cancer classification using SAGE gene expression profiles". In: International Workshop on Data Mining for Biomedical Applications. Springer. 2006, pp. 106–115.
- John D Kelleher, Brian Mac Namee, and Aoife D'arcy. Fundamentals of machine learning for predictive data analytics: algorithms, worked examples, and case studies. MIT Press, 2020.
- Robert Tibshirani. "Regression shrinkage and selection via the lasso". In: Journal of the Royal Statistical So- ciety: Series B (Methodological) 58.1 (1996), pp. 267– 288.