Impossibility Theorems for Feature Attribution
- URL: http://arxiv.org/abs/2212.11870v3
- Date: Sun, 7 Jan 2024 23:15:30 GMT
- Title: Impossibility Theorems for Feature Attribution
- Authors: Blair Bilodeau, Natasha Jaques, Pang Wei Koh, Been Kim
- Abstract summary: We show that for moderately rich model classes, any feature attribution method can provably fail to improve on random guessing for inferring model behaviour.
Our results apply to common end-tasks such as characterizing local model behaviour, identifying spurious features, and algorithmic recourse.
- Score: 21.88229793890961
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite a sea of interpretability methods that can produce plausible
explanations, the field has also empirically seen many failure cases of such
methods. In light of these results, it remains unclear for practitioners how to
use these methods and choose between them in a principled way. In this paper,
we show that for moderately rich model classes (easily satisfied by neural
networks), any feature attribution method that is complete and linear -- for
example, Integrated Gradients and SHAP -- can provably fail to improve on
random guessing for inferring model behaviour. Our results apply to common
end-tasks such as characterizing local model behaviour, identifying spurious
features, and algorithmic recourse. One takeaway from our work is the
importance of concretely defining end-tasks: once such an end-task is defined,
a simple and direct approach of repeated model evaluations can outperform many
other complex feature attribution methods.
Related papers
- Understanding prompt engineering may not require rethinking
generalization [56.38207873589642]
We show that the discrete nature of prompts, combined with a PAC-Bayes prior given by a language model, results in generalization bounds that are remarkably tight by the standards of the literature.
This work provides a possible justification for the widespread practice of prompt engineering.
arXiv Detail & Related papers (2023-10-06T00:52:48Z) - An Additive Instance-Wise Approach to Multi-class Model Interpretation [53.87578024052922]
Interpretable machine learning offers insights into what factors drive a certain prediction of a black-box system.
Existing methods mainly focus on selecting explanatory input features, which follow either locally additive or instance-wise approaches.
This work exploits the strengths of both methods and proposes a global framework for learning local explanations simultaneously for multiple target classes.
arXiv Detail & Related papers (2022-07-07T06:50:27Z) - Deciding What to Model: Value-Equivalent Sampling for Reinforcement
Learning [21.931580762349096]
We introduce an algorithm that computes an approximately-value-equivalent, lossy compression of the environment which an agent may feasibly target in lieu of the true model.
We prove an information-theoretic, Bayesian regret bound for our algorithm that holds for any finite-horizon, episodic sequential decision-making problem.
arXiv Detail & Related papers (2022-06-04T23:36:38Z) - On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification.
We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned.
Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z) - Sampling from Arbitrary Functions via PSD Models [55.41644538483948]
We take a two-step approach by first modeling the probability distribution and then sampling from that model.
We show that these models can approximate a large class of densities concisely using few evaluations, and present a simple algorithm to effectively sample from these models.
arXiv Detail & Related papers (2021-10-20T12:25:22Z) - Contrastive Learning for Fair Representations [50.95604482330149]
Trained classification models can unintentionally lead to biased representations and predictions.
Existing debiasing methods for classification models, such as adversarial training, are often expensive to train and difficult to optimise.
We propose a method for mitigating bias by incorporating contrastive learning, in which instances sharing the same class label are encouraged to have similar representations.
arXiv Detail & Related papers (2021-09-22T10:47:51Z) - Network Estimation by Mixing: Adaptivity and More [2.3478438171452014]
We propose a mixing strategy that leverages available arbitrary models to improve their individual performances.
The proposed method is computationally efficient and almost tuning-free.
We show that the proposed method performs equally well as the oracle estimate when the true model is included as individual candidates.
arXiv Detail & Related papers (2021-06-05T05:17:04Z) - An Empirical Comparison of Instance Attribution Methods for NLP [62.63504976810927]
We evaluate the degree to which different potential instance attribution agree with respect to the importance of training samples.
We find that simple retrieval methods yield training instances that differ from those identified via gradient-based methods.
arXiv Detail & Related papers (2021-04-09T01:03:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.