Related papers: How to safely discard features based on aggregate SHAP values

How to safely discard features based on aggregate SHAP values

URL: http://arxiv.org/abs/2503.23111v1
Date: Sat, 29 Mar 2025 15:07:30 GMT
Title: How to safely discard features based on aggregate SHAP values
Authors: Robi Bhattacharjee, Karolin Frohnapfel, Ulrike von Luxburg,
Abstract summary: Recently, SHAP has been increasingly used for global insights.<n>We ask whether small aggregate SHAP values necessarily imply that the corresponding feature does not affect the function.<n>We show that a small aggregate SHAP value implies that we can safely discard the corresponding feature.
Score: 12.610250597173437
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: SHAP is one of the most popular local feature-attribution methods. Given a function f and an input x, it quantifies each feature's contribution to f(x). Recently, SHAP has been increasingly used for global insights: practitioners average the absolute SHAP values over many data points to compute global feature importance scores, which are then used to discard unimportant features. In this work, we investigate the soundness of this practice by asking whether small aggregate SHAP values necessarily imply that the corresponding feature does not affect the function. Unfortunately, the answer is no: even if the i-th SHAP value is 0 on the entire data support, there exist functions that clearly depend on Feature i. The issue is that computing SHAP values involves evaluating f on points outside of the data support, where f can be strategically designed to mask its dependence on Feature i. To address this, we propose to aggregate SHAP values over the extended support, which is the product of the marginals of the underlying distribution. With this modification, we show that a small aggregate SHAP value implies that we can safely discard the corresponding feature. We then extend our results to KernelSHAP, the most popular method to approximate SHAP values in practice. We show that if KernelSHAP is computed over the extended distribution, a small aggregate value justifies feature removal. This result holds independently of whether KernelSHAP accurately approximates true SHAP values, making it one of the first theoretical results to characterize the KernelSHAP algorithm itself. Our findings have both theoretical and practical implications. We introduce the Shapley Lie algebra, which offers algebraic insights that may enable a deeper investigation of SHAP and we show that randomly permuting each column of the data matrix enables safely discarding features based on aggregate SHAP and KernelSHAP values.

Related papers

Statistical Inference for Temporal Difference Learning with Linear Function Approximation [62.69448336714418]
We study the consistency properties of TD learning with Polyak-Ruppert averaging and linear function approximation.<n>First, we derive a novel high-dimensional probability convergence guarantee that depends explicitly on the variance and holds under weak conditions.<n>We further establish refined high-dimensional Berry-Esseen bounds over the class of convex sets that guarantee faster rates than those in the literature.
arXiv Detail & Related papers (2024-10-21T15:34:44Z)
Provably Accurate Shapley Value Estimation via Leverage Score Sampling [12.201705893125775]
We introduce Leverage SHAP, a light-weight modification of Kernel SHAP that provides provably accurate Shapley value estimates with just $O(nlog n)$ model evaluations. Our approach takes advantage of a connection between Shapley value estimation and active learning by employing leverage score sampling, a powerful regression tool.
arXiv Detail & Related papers (2024-10-02T18:15:48Z)
Statistical Significance of Feature Importance Rankings [3.8642937395065124]
We devise techniques that ensure the most important features are correct with high-probability guarantees.<n>These assess the set of $K$ top-ranked features, as well as the order of its elements.<n>We then introduce two efficient sampling algorithms that identify the $K$ most important features, perhaps in order, with probability exceeding $1-alpha$.
arXiv Detail & Related papers (2024-01-28T23:14:51Z)
The Distributional Uncertainty of the SHAP score in Explainable Machine Learning [2.655371341356892]
We propose a principled framework for reasoning on SHAP scores under unknown entity population distributions. We study the basic problems of finding maxima and minima of this function, which allows us to determine tight ranges for the SHAP scores of all features.
arXiv Detail & Related papers (2024-01-23T13:04:02Z)
Fast Shapley Value Estimation: A Unified Approach [71.92014859992263]
We propose a straightforward and efficient Shapley estimator, SimSHAP, by eliminating redundant techniques. In our analysis of existing approaches, we observe that estimators can be unified as a linear transformation of randomly summed values from feature subsets. Our experiments validate the effectiveness of our SimSHAP, which significantly accelerates the computation of accurate Shapley values.
arXiv Detail & Related papers (2023-11-02T06:09:24Z)
A $k$-additive Choquet integral-based approach to approximate the SHAP values for local interpretability in machine learning [8.637110868126546]
This paper aims at providing some interpretability for machine learning models based on Shapley values. A SHAP-based method called Kernel SHAP adopts an efficient strategy that approximates such values with less computational effort. The obtained results attest that our proposal needs less computations on coalitions of attributes to approximate the SHAP values.
arXiv Detail & Related papers (2022-11-03T22:34:50Z)
On the Eigenvalues of Global Covariance Pooling for Fine-grained Visual Recognition [65.67315418971688]
We show that truncating small eigenvalues of the Global Covariance Pooling (GCP) can attain smoother gradient. On fine-grained datasets, truncating the small eigenvalues would make the model fail to converge. Inspired by this observation, we propose a network branch dedicated to magnifying the importance of small eigenvalues.
arXiv Detail & Related papers (2022-05-26T11:41:36Z)
Threading the Needle of On and Off-Manifold Value Functions for Shapley Explanations [40.95261379462059]
We formalize the desiderata of value functions that respect both the model and the data manifold in a set of axioms. We show that there exists a unique value function that satisfies these axioms.
arXiv Detail & Related papers (2022-02-24T06:22:34Z)
Robust and Adaptive Temporal-Difference Learning Using An Ensemble of Gaussian Processes [70.80716221080118]
The paper takes a generative perspective on policy evaluation via temporal-difference (TD) learning. The OS-GPTD approach is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs. To alleviate the limited expressiveness associated with a single fixed kernel, a weighted ensemble (E) of GP priors is employed to yield an alternative scheme.
arXiv Detail & Related papers (2021-12-01T23:15:09Z)
Piecewise Linear Regression via a Difference of Convex Functions [50.89452535187813]
We present a new piecewise linear regression methodology that utilizes fitting a difference of convex functions (DC functions) to the data. We empirically validate the method, showing it to be practically implementable, and to have comparable performance to existing regression/classification methods on real-world datasets.
arXiv Detail & Related papers (2020-07-05T18:58:47Z)
Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation. We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
Interpretable feature subset selection: A Shapley value based approach [1.511944009967492]
We introduce the notion of classification game, a cooperative game with features as players and hinge loss based characteristic function. Our major contribution is ($star$) to show that for any dataset the threshold 0 on SVEA value identifies feature subset whose joint interactions for label prediction is significant.
arXiv Detail & Related papers (2020-01-12T16:27:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.