SHAP vs. ALE for Feature Interactions: Understanding Conflicting Results

Author:Murphy | View: 25520 | Time: 2025-03-23 12:31:19

Model Explainers Require Thoughtful Interpretation

In this article, I compare model explainability techniques for feature interactions. In a surprising twist, two commonly used tools, SHAP and ALE, produce opposing results.

Probably, I should not have been surprised. After all, explainability tools measure specific responses in distinct ways. Interpretation requires understanding test methodologies, data characteristics, and problem context. Just because something is called an explainer doesn't mean it generates an explanation, if you define an explanation as a human understanding how a model works.

This post focuses on explainability techniques for feature interactions. I use a common project dataset derived from real loans [1], and a typical mode type (a boosted tree model). Even in this everyday situation, explanations require thoughtful interpretation.

If methodology details are overlooked, explainability tools can impede understanding or even undermine efforts to ensure model fairness.

Below, I show disparate SHAP and ALE curves and demonstrate that the disagreement between the techniques arise from differences in the measured responses and feature perturbations performed by the tests. But first, I'll introduce some concepts.

Feature Interactions

Feature interactions occur when two variables act in concert, resulting in an effect that is different from the sum of their individual contributions. For example, the impact of a poor night's sleep on a test score would be greater the next day than a week later. In this case, a feature representing time would interact with, or modify, a sleep quality feature.

In a linear model, an interaction is expressed as the product of two features. Nonlinear Machine Learning models typically contain numerous interactions. In fact, interactions are fundamental to the logic of advanced machine learning models, yet many common explainability techniques focus on contributions of isolated features. Methods for examining interactions include 2-way ALE plots, Friedman's H, partial dependence plots, and SHAP interaction values [2]. This blog explores two of these: ALE and SHAP.

ALE Plots

Accumulated Local Effects (ALE) are a technique to measure feature effects, without distortions that can result from correlated features or unlikely feature combinations. Feature interactions can be visualized using two-way ALE plots. Two-way ALE plots are generated by first measuring the change in model output when one feature (i) is perturbed, for a fixed value of a second feature (j). Then, (a similar measurement is taken at a slightly different value of j. The difference of the two measurements reveals how perturbing j affects the model response to changes in i. To reduce effects from unlikely feature combinations, measurements use only observations with values in a small window around selected values of i and j.

SHAP Interaction Values

Shapley Values represent the amount of credit or blame each feature holds for the model output. "SHAP" refers to a set of methods for calculating Shapley values for machine learning models. SHAP calculations measure the change in model response with a feature set to its original value compared to a reference value. This marginal contribution of the feature is averaged over various combinations, or "coalitions", of the other features. Shapley coalitions are formed by substituting some feature values for values randomly drawn from a reference dataset (usually the training data). Unlike ALE, Shapley involves perturbations of many features, not just the pair of interest, and values are calculated for every observation.

SHAP interaction values distribute model scores among all feature main effects and pairwise interactions [3]. For one observation, interactions for a feature pair (i and j), are calculated by measuring the Shapley value for a feature i given its original value of j. Then, j is replaced by a value drawn at random for the reference, and a new Shapley value for i is calculated. These two measurements are subtracted to quantify how feature j modifies the Shapley value of i.

Data and Model

I used the Lending Club Loan dataset, available through Kaggle [1]. The model predicts loan defaults based on features such as interest rate, loan term, borrower income, borrower home ownership status, and individual or joint borrower credit scores. Leveraging analyses done by others [1, 4–5], I selected 18 predictor features, and the response variable is a binary indicator of loan default. A boosted tree model was trained, using Scikit-learn's GradientBoostingClassifier, which is compatible with Python packages available for ALE plots (PyALE), SHAP values (SHAP), and Friedman's H (sklearn_gbmi). Code is available in GitHub [6].

SHAP and ALE Disagree for Influential Features

To examine interactions in my model, I generated SHAP dependence and 2-way ALE plots for pairs of features. For most pairs, ALE and SHAP plots were at least somewhat similar. But for one key interaction, interest rate and term, results conflicted:

**Figure 1.** Measures of interaction for interest rate and term. Positive values indicate higher predicted default risk. The x axes show interest rate, and colors correspond to term values; red is 60 months and blue is 36 months. A. Line plot of ALE values for the interest rate:term interaction. B. Scatter plot showing SHAP interaction values for interest rate:term. Image by author

The 2-way ALE indicates that longer loan terms in combination with high interest rate increase risk. But SHAP tells an opposite story; longer term is protective against defaults at high interest rate!

In this dataset, loan duration (term) is categorical with just two values, 36 and 60 months, while interest rate (int_rate) is continuous. Figure 1A and B show ALE and SHAP values, respectively, plotted on the same scale, with positive values indicating an increase in modeled default risk due to interaction. Although heatmaps are often used for 2-way ALE plots, I prefer line plots; these are also easier to compare with SHAP plots.

The contradictory data in Figure 1 was especially troubling to me because interest rate and term are the two most important features in the model by several measures (aggregated Shapley values, impurity, and permutation importance; see [6]). In addition, the term:interest rate interaction is large according to SHAP, ALE, and Friedman's H.

So, I have two influential features with an important interaction, but SHAP and ALE show different effect directions. Does common sense help resolve the conflict? Here are some possible interpretations of the curves:

(ALE) A long-term, high-interest rate loan is especially risky.

(SHAP) A high interest rate is strongly predictive of default; term isn't very important when rates are high.

(The SHAP story also draws from the one-way ALE responses [6]. The negative interaction cancels out the one-way term response, and so the interaction can be interpreted as a loss of influence of the term feature.)

As a non-expert in lending, both accounts seem plausible to me. Understanding why these plots differed meant gaining a deeper understanding of these techniques, as well as examining simplified models.

SHAP vs. ALE – What Differences Are Important?

Both SHAP interactions and two-way ALE values measure differences in a model response when a feature, j, is modified, for a subset of data points with similar values for feature i

Starting from the above statement, let's list some ways SHAP and ALE may differ:

1. The data points selected to perform the measurement.

2. The response being measured by the test.

3. How features values are modified.

Item #1 seems an unlikely culprit. For Shapley, a measurement is done for each data point, and we use the original feature value. ALE considers a window around a value. The window size is based on data density, so higher interest rate points reflect a relatively large range of values, but for "high interest rate" sections of both plots in Figure 1, we probably have similar enough observations.

Differences in items #2 and #3 could be important in explaining the discrepancies in the plots. For #2, SHAP and ALE test different model responses. ALE uses the raw model output, whereas SHAP distributes the model prediction across multiple features and examines the portion attributed to i.

For item #3, ALE perturbations involve substitutions of values for feature j only. But SHAP aggregates responses from many ensembles of model features; all variables are perturbed. Replacement values are drawn at random from the training data, generally reflecting more typical values, which may be very different from the initial observation's features.

Rare Cases Generate the ALE Signal

After performing model simplification and other analyses (see code in [6]), I realized that the ALE test is responding to the model's predictions for cases with several risk factors. Below, I re-calculate the ALE and SHAP plots for only customers with incomes over $45,000, who are not renters:

**Figure 2.** Measures of interaction for interest rate and term, with values calculated for only higher income, non-renter customers. P. Positive values indicate higher predicted default risk. The x axes show interest rate, and colors correspond to term values; red is 60 months and blue is 36 months. A. Line plot of ALE values for the interest rate:term interaction. B. Scatter plot showing SHAP interaction values for interest rate:term. Image by author.

When low-income and renter cases (about 50% of total) are excluded, the ALE signal almost completely disappears, while the SHAP curve is qualitatively unchanged.

The original ALE curve from Figure 1A can be reproduced with a simplified, single-tree model that involves only three features (interest rate, term, and annual income), as shown below:

**Figure 3.** Simple model to reproduce the feature interaction detected by ALE. A. Diagram of a single decision tree model. Traversal of the tree starts at node 0, and progresses left when the condition shown in the box is true, right otherwise. Traversal ends at a leaf node; the response is the value in that node. The boxes are shaded according to model response values (for non-leaf nodes, values reflect averages of descendent leaves). The number of samples in the training data reaching each node is noted in the boxes. B Line plot of ALE values for interest rate:term, for the tree shown in A. C. Scatter plot showing SHAP interaction values for interest rate:term, for the tree in A. Some outlier points have been cropped for the SHAP curve, as discussed in the text. Image by author.

Figure 3A contains low-population nodes with very high or low values (e.g., nodes 2 and 7). Node 7 is visited by rare customers who have low incomes, high interest rates, and long terms; these have very high default risk.

The ALE plots are dominated by effects arising from rare feature combinations . Node 7 represents a tiny number of loans, but when term is swapped during the ALE calculation, the model response changes dramatically. The 60-month customers move out of this node, decreasing risk, while (a much larger number of) 36-month customers move into this node, leading to a large signal.

SHAP Detects Systematic Effects in Complex Models

The SHAP signal in Figures 1B disappears in Figure 3B. Model complexity is key to the SHAP result. To reliably reproduce the original SHAP curve, I found that I needed ≥4 features, 20 trees, and depths >5 (see code in [6]).

The SHAP curve in Figure 3B contains outliers above 30% interest rates (some of these are cropped in the figure; outlier values as high as ~0.4 occur). If outlier values are averaged, the 36- and 60-month values are very similar and near zero (~0.001). The outliers are due to cases with coalitions that visit extreme nodes 4 and 7. Model complexity reduces outliers. As feature count increases, drawing multiple unusual values from the reference foil becomes less likely. Moreover, more coalitions are averaged in the calculation, diluting signals.

SHAP measurements de-emphasize rare feature combinations. SHAP coalitions can involve feature values very different from the original, whereas ALE calculations usually involve perturbations in a more restricted range. SHAP coalitions provide more coverage of the model, reflecting values generated by more nodes, especially higher-population nodes.

The extent of feature changes in SHAP calculations depends on whether an observation unusual relative to the reference data. The flatness of the 36-month term curve in Figures 1B and 2B reflects the fact that most customers (75%) have 36-month loans. Therefore, random values pulled to generate SHAP coalitions for term are likely to leave term unchanged. Subtraction of two similar curves results in a small SHAP interaction value.

In contrast, the 60-month term curve is farther from typical, and so generates SHAP signals. The negative value at high interest rates and 60 month term indicates that the interest rate feature is more influential for the lower term value of 36 months. More loans are 36 months, and most loans have moderate risk, and so a high interest rate in that context is more of a surprise. For the 60-month term, a high interest rate is less surprising (interest rate and term have a Pearson correlation of ~0.4), and so it may be expected that SHAP assigns less weight to the interest rate feature for long-term loans.

So, Which is Correct?

Earlier, I described two distinct stories suggested by the curves in Figure 1:

(ALE) A long-term, high-interest rate loan is especially risky.

(SHAP) A high interest rate is strongly predictive of default; term isn't very important when rates are high.

It seems that both these stories are true, but for different customers. For infrequent cases with multiple risk factors, the first interpretation is correct; interest rate and term combine to produce very large ALE responses. But for more typical, higher-interest-rate customers, interest rates capture most of the risk. Therefore, SHAP and ALE tests draw attention to different customers.

Why Does This Matter?

After applying explainability tools, we expect that we will increase our understanding of how a model works. We believe we will have a general sense of the model's decision process, or even that some patterns in the data may be revealed. These tests are used for quality control and trust building. When explanations align with expectations, stakeholders are reassured.

Explainability tools can provide many benefits, but they also have the potential to mislead or provide false reassurance.

Explainability tools are especially important in model fairness testing to avoid bias and discrimination. Interaction measures are crucial when feature bias is present or suspected [7]. SHAP's lack of response to rare feature combinations may be a concern because combinations of characteristics like sex, race, or age might be linked to adverse outcomes. Conversely, ALE may miss systematic effects because it perturbs a smaller number of features in a more limited range.

Final Thoughts

Model explainability packages are often described as "explainers" that output "explanations". I think it's more useful to use words like test or measurement. For example, "SHAP value" is better than "SHAP explanation" because there is some distance between package outputs and actual understanding of a complex model. I am trying to change how I use these terms as a reminder of this!

In medicine, diagnostic tests are ordered in specific circumstances and results are interpreted by experts. Often, more than one test is used to establish a diagnosis. Similarly, a deeper understanding of model explainability tools is needed to draw meaningful conclusions.