How can XGBoost support natively categories?

XGBoost and others decision tree-based methods trained using Gradient Boosting use comparison for decision. It's not trivial to mathematically define a comparison operator for categories.
In this article, we are going to explain what are the options available, detail what are their pros and cons, and focus on the native support for categorical features that have been recently introduced in XGBoost (and LightGBM).
If you're interested in Gradient Boosting and its application to Decision Tree training, please consider my book:
Practical Gradient Boosting: A deep dive into Gradient Boosting in Python
Decision trees
As shown in the figure below, decision trees base their decision on comparison:

For instance, in this simple example, if the input is a row of data, with two columns A=21
and B=111
, then the output value will be the weight 4
.
The limitation of this kind of decision tree is that it can only handle numbers as features.
Standard ways of handling categorical features
However, it is ubiquitous in data science to be confronted with features that are categorical. Let's see what are the options at hand.
One hot encoding
One commonly encountered way to handle categorical values is to use one hot encoding. The idea here is to convert the non-numerical categorical values into numerical values by creating a column for each possible category.
When the category applies to the current line of the dataset, the corresponding columns are set to 1 and 0 otherwise.
The snippet of code below shows the standard way to make one hot encoding with pandas:
One of the main limitations of one hot encoding is that you will add as many columns in your dataset as they are distinct categorical values.
It's also important to note that you must have the same distinct values when training and predicting, otherwise some columns will be missing during prediction.
GLMM encoding
Another option is to use encoding, and more specifically GLMM encoding. The idea here is to convert non-numerical categories to numbers, using a model.
The conversion is done using GLMM, aka Generalized Linear Mixed Models.
Here generalized states that GLMMs are simply a generalization of linear models where the target variable can be transformed using a function like a logarithm. Hence, using the same mathematical framework, standard linear regression as well as logistic regression can be modeled.
And mixed indicates that the models can integrate fixed and random effects, i.e. that the mean of the variable to predict is rather fixed or random for a set of observations.
Applied to the case of categorical encoding, GLMMs will capture the random effect of each category fitting this kind of model:

Where Y_i
is the value predicted by the Mixel Model for the target Y
and the category i
. μ is the global mean of Y
and is the fixed effect, while RE_ci
is the random effect due to the category i.
Hence RE_ci
captures the effect of the category, and this is the encoded value of the category.
Using pandas
and the library caregorical_encoders
, all this theory subsumes to these few lines of code:
And if you're curious, the important line in the categorical_encoders
library is:
This exactly trains a model has described by the formula above.
XGBoost Native support
As seen above, it's possible to alter the property of categorical values by either converting them to real values (glmm encoding) or by changing the dataset's structure and introducing a column for each possible category (one hot encoding).
But this does not seem to be the best way to integrate the support for categories in the standard structure of Gradient Boosting: decision tree.
Inclusion instead of comparison
Indeed, it would seem natural to replace comparison with inclusion as the decision operation. I.e. instead of checking whether a value is greater or lesser than a given threshold, we could check for the inclusion of the value in a given set of categories.
This means that we can replace the comparison with a threshold

by

Based on the code of my previous article:
This can be done by updating the method creating the node condition function to support inclusion instead of comparison:
Exhaustive generation possible lists of categories
In the standard method, which uses value as the splitting condition, the candidates splits are generated by ordering the values in the column being considered and keeping unique values.
Based on this splitting value, the gain is evaluated for the left and right nodes resulting from this splitting.
When using categories, ordering is no longer relevant, and candidate conditions for splitting are no longer a single value, but a list of categories.
As we have no previous knowledge on the best way to regroup categories in sets that would optimize gain, one option is to generate all the possible combinations of categories.
Hence again, based on the code of my previous article, we extract the code that creates threshold-based splitting for numerical values in _numerical_split
and we introduce a new method _categorical_split
to generate possible combinations:
Note that pandas masks are a very convenient way to handle both cases similarly.
Putting it all together
The complete code, that implements from scratch Gradient Boosting for decision trees, and support categorical as well as numerical data can be found below:
Except for the two main changes presented above, it's mainly the same code as in my previous article, but slightly more generic so that numerical as well as categorical data can be supported.
In the basic example provided, using categorical or numerical splitting give the same results. The code is simply slower when using categorical support, has the number of combinations explored is important:
Note that we are using the internal pandas code category
to distinguish between numerical and categorical data.
Performance issues
The implementation presented above works as expected and do support categorical values.
However, as we are exhaustively exploring all the possible combinations of categories, using python itertools.combinations
, the code will become very slow when there are more than a few unique categories.
Indeed the number of k
possible combinations for a set of n
categorical values is given by:

This is why XGBoost and Lightgbm use a heuristic to reduce this number drastically. More details can be found here:

Conclusion
I strongly advise using internal support for categorical data when using Xgboost or LightGBM. It's not only a more memory and computation-efficient method, but it also provides good precisions.
It also simplifies the data preparation pipeline, and natively supports missing values when training and prediction sets are disjoint.
If you're interested in Gradient Boosting and its application to Decision Trees, please consider my book:
Practical Gradient Boosting: A deep dive into Gradient Boosting in Python