Exploring Hierarchical Blending in Target Encoding

What neighborhood do you live in? What drug were you prescribed? Why did you cancel your streaming subscription? These days, there's a code for that, stored in databases by whatever governments agencies, businesses, etc. you interact with. If you work in data, you probably encounter many such codes. When they can take many possible values, such codes are called "high cardinality categorical features".
Some high-cardinality categoricals have a hierarchical structure. Figure 1 depicts such a structure, the North American Industry Classification System (NAICS), which is used by the US government to classify businesses [1].

Many code sets can be represented as a hierarchy. For example, US geographic regions can be divided into smaller areas with many code values (zip codes), or very large ones with few (US Census regions, e.g. "West"). Or, the American Medical Association defines ~475 areas of provider specialization, which are rolled up into classifications, groupings, and sections.
Although high-cardinality categoricals (hierarchical or not) have too many degrees of freedom for direct incorporation into machine learning models, encoding or embedding methods can leverage information in these features. Target encoding (also called "mean encoding" or "impact encoding") is a popular choice for tree-based models. Neural networks often use entity embeddings that map codes to a vector of lower dimensionality. However, these techniques do not incorporate information from hierarchical code structures.
There are some exciting research methods for hierarchical categoricals in machine learning, for example treating high-level groupings in a manner analogous to random effects in mixed modeling [2]. But these are not yet in widespread use.
A simpler option relevant for tree-based models is suggested by a Towards Data Science article by Daniele Micci-Barreca, which involves blending general group information in high-cardinality categoricals into target encodings [3].
Hierarchical blending is relevant for unseen or low volume codes, which may be important in certain contexts. For instance, maybe a company is expanding into a new market, or an existing drug is being used to treat a different disease.
Hierarchical blending is relevant for unseen or low volume codes
Hierarchical blending tries to improve target mean estimates for unseen or low volume codes by using information from more general groups. For example, if we don't have data from bagel shops, perhaps the average behavior of "Limited-Service Eating Places" provides a starting point.
This sounds plausible, and I set off to test this blending strategy for NAICS codes in a public dataset.
Happily, using the full NAICS hierarchy improved the model performance for unseen codes in this dataset. However, results were less favorable when I used only general levels of the hierarchy. Some code structures were counter-productive.
In this blog post, I compare results for standard target encoding and entity embeddings with various blended target encodings. I also try some constructed groups using a graph neural network technique.
I find that hierarchical blending can lead to overfitting for some groupings. There appears to be a bias-variance tradeoff, and models can amplify systematically incorrect estimates of target means . In addition, groupings that are too broad may be less effective. Given these results, it may be advisable to test a code structure to make sure it will work with hierarchical blending.
In the following sections, I try hierarchical blending for NAICS in a public small business loans dataset. Results are compared to other methods of dealing with hierarchical categorical features. I explore possible limitations of hierarchical blending by trying different levels of the NAICS "pyramid", as well as alternative groupings derived from the data using a graph neural network technique.
Dataset and Methods
This blog post focuses on North American Industry Classification System (NAICS) industry codes [1] in U.S. Small Business Administration (SBA) Loans dataset [4–5]. The models predict loan defaults among small businesses. Code can be found on GitHub [6]
Kaggle models for the SBA Loans dataset often show strong performance, with accuracies and AUC scores in the high 90. A highly important feature is loan term [7, 8]; I think term probably reflects credit worthiness of a business, based on information not made public. The SBA apparently does a pretty good job assessing risk, as this feature is a strong predictor. However, I am interested in NAICS; therefore, I leave out loan term, and model performance becomes much more modest. Leveraging data exploration done by others [7–9], I select 8 predictors, which include NAICS, number of employees, jobs created by the loans, loan amounts, and franchise status.
SBA loans contains loans with activity in 1987–2014. I use loans approved after 1990 and drop rows with missing NAICS, so my dataset has 688,081 rows and 1,311 unique 6-digit NAICS codes (likely drawn from several NAICS vintages). As target encoding blending focuses on low-volume or unseen categories, I set aside a random 10% sample of NAICS; the hold-out set is 131 codes and 93,454 rows. For the remaining rows, I do a 70/15/15 train/validation/test split. Results will be shown for the test split and holdout dataset, which are unused in training or hyperparameter tuning.
XGBoost Modeling
Baseline Models
Let's start by measuring performance without NAICS information, using XGBoost, a boosted tree model. About 20% of loans default, so rates are slightly imbalanced. If this were a real-life application, detecting defaults would likely be a priority, so I will show on precision-recall AUC (PR-AUC) scores; ROC AUC trends are similar.
In addition to a no-NAICS model, I look at one-hot encoding. The top of the NAICS "pyramid" are the 20 sectors (Figure 1). I use Scikit-Learn's [10] OneHotEncoder, including levels with at least 5% of rows; 9 of the 20 sector values are encoded. The rest are combined into an "other" group.
Table 1 shows performance for the no-NAICS baseline and one-hot encoding models. One-hot sector encoding provides a small performance boost for the standard test set.

I use images of tables for readability in this blog; raw data is available in the repo [6]
XGBoost Models
Target encoding is a common strategy for high-cardinality categoricals. Target encoding essentially replaces categorical levels with their mean response, but to prevent overfitting from low-volume or unseen codes, the overall mean is "blended" into category means. The degree of blending is high for infrequent codes, while frequent codes are less affected.
There are many packages that make target encoding easy; however, for better comparison I create my own encoder which uses a sigmoidal blending function, with the midpoint at 25 loans, and a width of 20; this same blending function will be used for hierarchical blending also.
The first teal row in Table 2 below shows model performance with this standard target encoding. For the usual randomly selected test set, I a large performance boost over baseline! Of course, the unseen codes don't receive this benefit, as they are simply assigned the mean overall default rate (gray shaded cell).

Micci-Barreca [3] suggests incorporating hierarchical category information in target encoding. Instead of being blended with the overall mean response, low-volume or unseen codes shrink towards the mean for the next level up in the code hierarchy, if available. The blending is repeated for the next level of the pyramid, as necessary, with the contributions from each hierarchy level weighted according to population.
When I use the full NAICS hierarchy, I see performance improvement for holdout NAICS (row 4 in Table 2)!
What if I start blending at a different level in the pyramid? The remaining rows in Table 2 show what happens if I use hierarchical blending startung higher in the pyramid. Benefits may decrease above "Subsector". For sector blending, the performance for the holdout set is below baseline.
But what if I start blending at a different level in the pyramid? The last 2 rows in Table 2 show what happens if I use hierarchical blending from "Subsector", I see less benefit. For sector blending, the performance for the holdout set is below baseline.
It appears that the details of the hierarchy matter. Different code sets would have different organizations, with different granularities. Could there be categoricals where hierarchical blending will not work at all?
Neural Network Models
In addition to XGBoost, I tried hierarchical blending in neural network models. Results are compared to standard techniques (Table 3):

The baseline model without NAICS has similar performance as XGBoost (the tanh activation is helpful for this data). As another baseline, I tried entity embeddings, which is a standard method for high-cardinality features in neural networks (row 2 in Table 3). Entity embeddings are outputs from an intermediate neural network layer, which map the high-cardinality feature into a lower-dimensionality numeric space (I use 8 dimensions).
Entity embeddings show a huge performance improvement for the random test set, but are tragic for the unseen codes. I have no missing NAICS, and unseen codes are mapped to a value ignored by the model. Therefore, the model training does not optimize for these. When I randomly assign 10% of the training data to the ignored value, the holdout performance is more like the baseline (row 3 in Table 3).
If, instead of entity embeddings, I include target encoded NAICS as a feature, I see similar results to the entity embedding with random missing values. Using hierarchical target encoding enhances unseen performance in a similar manner as what was seen in XGBoost models.
In sum, neural network results for target encodings are qualitatively similar to XGBoost. An additional observation is that entity embeddings can be problematic for unseen codes. If generalizing a model to unseen codes is important, I'd probably avoid entity embeddings; to me, the random missing solution feels like more hassle than target encoding.
In the next section, I'll explore some embedding methods to understand how the specific code hierarchies affect results. I start by constructing an alternate hierarchy based on data.

Deep Graph Infomax Embeddings
Earlier, I used different parts of the standard NAICS hierarchy to look for limitations of hierarchical target encoding. Here, I try alternative groupings, hoping to get more insights about the effectiveness of this method. I will use Deep Graph Infomax (DGI) to construct a hierarchy from features in the data.
DGI is an unsupervised Graph Machine Learning technique [11]. Representations are learned from connected nodes in a graph, extracting feature relationships associated with correct structures.
I use DGI to get a lower-dimensionality summary of information that distinguishes one NAICS from another. My graph is simple; loan nodes are connected to NAICS nodes. I again fit embeddings of size 8. See the GitHub repository [6] for more methodology details. Because DGI is based on predictors, it can generate embeddings for unseen codes. (Here, I train DGI only on the training and validation slices, but it wouldn't be wrong to use the whole dataset, depending on the use case.) Figure 2 shows a TSNE [12] visualization of the results.

NAICS sector doesn't seem strongly related to the embeddings; the silhouette score is -0.14. However, Figure 2 shows that the embeddings have some relationship with loan default rates.
To group codes, I use k-means clustering. The best clustering occurs at k=3 (silhouette score 0.60). Figure 3 shows results for k-means clustering of embeddings. At higher k, the groups are not necessarily well-separated, but are adjacent and can be seen as a segmentation.

For comparison with NAICS codes, I use levels of similar granularity for DGI (based on counts in my dataset). Cluster counts of k=834, 353, 106 and 20 are used to construct groups analogous to industry, industry group, subsector, and NAICS sector, respectively. Table 4 shows results from hierarchical blending using this "pyramid":

The DGI groupings perform very poorly! Most groupings are counter-productive.
From the above, I first conclude that hierarchical blending doesn't always work. Notably, DGI groupings are correlated with target rates, and when k=10 clusters are compared to the 9+1 unknown NAICS groups in one-hot encoding, DGI groups perform better than standard NAICS groups for the holdout codes (not shown, see [6]).
What makes the DGI groupings ineffective in hierarchical encoding, despite working well enough when used other ways? I see that the encoded values themselves seem to generalize poorly, compared to NAICS codes. target encodings are themselves a (simple) model of loan defaults; therefore, their performance can be measured (Table 5).

Hierarchical blending is a better model of mean response than "regular" target encoding, for both data sets and both groupings. However, although DGI-based hierarchical encodings have similar results as NAICS-based hierarchical encodings on the random test data, they perform worse for holdout codes.
When mean encoded values are compared to actual target rates, RMSEs are similar for standard target encoding and both types of hierarchical encodings. The actual values of the DGI-based encodings show a tendency towards lower values (mean of 17.5% vs. 19.6% for NAICS-based encodings).
SHAP dependence plots for encodings in XGBoost models show a similar response curve for the standard target encodings and both types of hierarchical encodings (Figure 4A). PDP results are very similar (not shown, see [6]). The similarity of the response curves makes sense, as most codes have similar values for all three encoding schemes.

However, the actual value of DGI based encodings is lower, and further from the training mean default rate of 20.5% (Figure 4B). Because the model response is stronger away from the mean, the errors in DGI embeddings are amplified; the model is responding too strongly and overfits.
Final Thoughts
All generalities are false, including this one – Mark Twain
Hierarchical blending is only useful if a classification system is relevant to the problem and at the correct level of detail.
Hierarchical blending complicates data prep and may not improve performance for all codes sets. But, the technique may be invaluable when a hierarchy is suitable and unseen codes are important to your purpose.
Even when rare or new codes are crucial, not all code groupings will be useful. Hierarchical blending may not work when codes are unrelated to the target response, are at the wrong level of granularity, or bias mean estimates.
Before using hierarchical encoding, it may be wise to test performance for holdout codes to assess whether a code set is suitable for this method. Sometimes using general groups with one-hot encodings may be safer. In other lucky cases, like that of standard NAICS codes in this dataset, hierarchical blending can provide a nice performance improvement.
In my day job, I see hierarchical categoricals all the time, and I often feel that these are not leveraged to their full potential. Therefore, I'm excited that hierarchical encoding has potential to help some models! In future work, I hope to examine information in entity and DGI embeddings, and possibly to explore more encoding methods.
Thank you to so much to Daniele Micci-Barreca for this inventing this methodology, and for very helpful suggestions!
References
[1] United States Census, North American Industry Classification System.
[2] B. Avanzi, G. Taylor, M. Wang and B. Wong, Machine Learning with High-Cardinality Categorical Features in Actuarial Applications (2023), arXiv:2301.12710v1.
[3] D. Micci-Barreca, Extending Target Encoding (2020), Towards Data Science.
[4] M. Li, A. Mickel and S. Taylor, Should This Loan be Approved or Denied?: A Large Dataset with Class Assignment Guidelines (2018), Journal of Statistics Education 26 (1).
[5] M. Toktogaraev, Should This Loan be Approved or Denied? (2020), Kaggle.
[6] V. Carey, GitHub Repository, https://github.com/vla6/Blog_gnn_naics.
[7] R. Colindres, Loan Default Prediction & Loan Parameter Optimizer (2023), Kaggle.
[8] M. B. Loans risk assessment (acc. 94.5%) (2021), Kaggle.
[9] R. Aryo, SBA Loan Approval Model (2023), Kaggle.
[10] F. Pedregosa et al, Scikit-learn: Machine Learning in Python (2011), JMLR 12, 2825–2830.
[11] P. Veličković et al., Deep Graph Infomax (2018) arXiv:1809.10341.
[12] L. van der Maaten and G. Hinton, Visualizing Data using t-SNE (2008), Journal of Machine Learning Research 9: 2579–2605.