Related papers: Search Methods for Sufficient, Socially-Aligned Feature Importance Explanations with In-Distribution Counterfactuals

Search Methods for Sufficient, Socially-Aligned Feature Importance Explanations with In-Distribution Counterfactuals

URL: http://arxiv.org/abs/2106.00786v1
Date: Tue, 1 Jun 2021 20:36:48 GMT
Title: Search Methods for Sufficient, Socially-Aligned Feature Importance Explanations with In-Distribution Counterfactuals
Authors: Peter Hase, Harry Xie, Mohit Bansal
Abstract summary: Feature importance (FI) estimates are a popular form of explanation, and they are commonly created and evaluated by computing the change in model confidence caused by removing certain input features at test time. We study several under-explored dimensions of FI-based explanations, providing conceptual and empirical improvements for this form of explanation.
Score: 72.00815192668193
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Feature importance (FI) estimates are a popular form of explanation, and they are commonly created and evaluated by computing the change in model confidence caused by removing certain input features at test time. For example, in the standard Sufficiency metric, only the top-k most important tokens are kept. In this paper, we study several under-explored dimensions of FI-based explanations, providing conceptual and empirical improvements for this form of explanation. First, we advance a new argument for why it can be problematic to remove features from an input when creating or evaluating explanations: the fact that these counterfactual inputs are out-of-distribution (OOD) to models implies that the resulting explanations are socially misaligned. The crux of the problem is that the model prior and random weight initialization influence the explanations (and explanation metrics) in unintended ways. To resolve this issue, we propose a simple alteration to the model training process, which results in more socially aligned explanations and metrics. Second, we compare among five approaches for removing features from model inputs. We find that some methods produce more OOD counterfactuals than others, and we make recommendations for selecting a feature-replacement function. Finally, we introduce four search-based methods for identifying FI explanations and compare them to strong baselines, including LIME, Integrated Gradients, and random search. On experiments with six diverse text classification datasets, we find that the only method that consistently outperforms random search is a Parallel Local Search that we introduce. Improvements over the second-best method are as large as 5.4 points for Sufficiency and 17 points for Comprehensiveness. All supporting code is publicly available at https://github.com/peterbhase/ExplanationSearch.

Related papers

Saliency Methods are Encoders: Analysing Logical Relations Towards Interpretation [0.11510009152620666]
Saliency maps are often generated to improve explainability of neural network models. This paper introduces a test for saliency map evaluation: proposing experiments based on all possible model reasonings over simple logical datasets. Using the contained logical relationships, we aim to understand how different saliency methods treat information in different class discriminative scenarios. Our results show that saliency methods can encode classification relevant information into the ordering of saliency scores.
arXiv Detail & Related papers (2024-12-17T08:55:17Z)
ReCEval: Evaluating Reasoning Chains via Correctness and Informativeness [67.49087159888298]
ReCEval is a framework that evaluates reasoning chains via two key properties: correctness and informativeness. We show that ReCEval effectively identifies various error types and yields notable improvements compared to prior methods.
arXiv Detail & Related papers (2023-04-21T02:19:06Z)
Explanation Selection Using Unlabeled Data for Chain-of-Thought Prompting [80.9896041501715]
Explanations that have not been "tuned" for a task, such as off-the-shelf explanations written by nonexperts, may lead to mediocre performance. This paper tackles the problem of how to optimize explanation-infused prompts in a blackbox fashion.
arXiv Detail & Related papers (2023-02-09T18:02:34Z)
Don't Explain Noise: Robust Counterfactuals for Randomized Ensembles [50.81061839052459]
We formalize the generation of robust counterfactual explanations as a probabilistic problem. We show the link between the robustness of ensemble models and the robustness of base learners. Our method achieves high robustness with only a small increase in the distance from counterfactual explanations to their initial observations.
arXiv Detail & Related papers (2022-05-27T17:28:54Z)
Double Trouble: How to not explain a text classifier's decisions using counterfactuals synthesized by masked language models? [34.18339528128342]
An underlying principle behind dozens of explanation methods is to take the prediction difference between before-and-after an input feature is removed as its attribution. A recent method called Input Marginalization (IM) uses BERT to replace a token, yielding more plausible counterfactuals. However, our rigorous evaluation using five metrics and on three datasets found IM explanations to be consistently more biased, less accurate, and less plausible than those derived from simply deleting a word.
arXiv Detail & Related papers (2021-10-22T17:22:05Z)
Contrastive Explanations for Model Interpretability [77.92370750072831]
We propose a methodology to produce contrastive explanations for classification models. Our method is based on projecting model representation to a latent space. Our findings shed light on the ability of label-contrastive explanations to provide a more accurate and finer-grained interpretability of a model's decision.
arXiv Detail & Related papers (2021-03-02T00:36:45Z)
Towards Unifying Feature Attribution and Counterfactual Explanations: Different Means to the Same End [17.226134854746267]
We present a method to generate feature attribution explanations from a set of counterfactual examples. We show how counterfactual examples can be used to evaluate the goodness of an attribution-based explanation in terms of its necessity and sufficiency.
arXiv Detail & Related papers (2020-11-10T05:41:43Z)
Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? [97.77183117452235]
We carry out human subject tests to isolate the effect of algorithmic explanations on model interpretability. Clear evidence of method effectiveness is found in very few cases. Our results provide the first reliable and comprehensive estimates of how explanations influence simulatability.
arXiv Detail & Related papers (2020-05-04T20:35:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.