Towards Unifying Feature Attribution and Counterfactual Explanations:
Different Means to the Same End
- URL: http://arxiv.org/abs/2011.04917v3
- Date: Sat, 29 May 2021 17:49:39 GMT
- Title: Towards Unifying Feature Attribution and Counterfactual Explanations:
Different Means to the Same End
- Authors: Ramaravind Kommiya Mothilal and Divyat Mahajan and Chenhao Tan and
Amit Sharma
- Abstract summary: We present a method to generate feature attribution explanations from a set of counterfactual examples.
We show how counterfactual examples can be used to evaluate the goodness of an attribution-based explanation in terms of its necessity and sufficiency.
- Score: 17.226134854746267
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Feature attributions and counterfactual explanations are popular approaches
to explain a ML model. The former assigns an importance score to each input
feature, while the latter provides input examples with minimal changes to alter
the model's predictions. To unify these approaches, we provide an
interpretation based on the actual causality framework and present two key
results in terms of their use. First, we present a method to generate feature
attribution explanations from a set of counterfactual examples. These feature
attributions convey how important a feature is to changing the classification
outcome of a model, especially on whether a subset of features is necessary
and/or sufficient for that change, which attribution-based methods are unable
to provide. Second, we show how counterfactual examples can be used to evaluate
the goodness of an attribution-based explanation in terms of its necessity and
sufficiency. As a result, we highlight the complementarity of these two
approaches. Our evaluation on three benchmark datasets - Adult-Income,
LendingClub, and German-Credit - confirms the complementarity. Feature
attribution methods like LIME and SHAP and counterfactual explanation methods
like Wachter et al. and DiCE often do not agree on feature importance rankings.
In addition, by restricting the features that can be modified for generating
counterfactual examples, we find that the top-k features from LIME or SHAP are
often neither necessary nor sufficient explanations of a model's prediction.
Finally, we present a case study of different explanation methods on a
real-world hospital triage problem
Related papers
- When factorization meets argumentation: towards argumentative explanations [0.0]
We propose a novel model that combines factorization-based methods with argumentation frameworks (AFs)
Our framework seamlessly incorporates side information, such as user contexts, leading to more accurate predictions.
arXiv Detail & Related papers (2024-05-13T19:16:28Z) - Reckoning with the Disagreement Problem: Explanation Consensus as a
Training Objective [5.949779668853556]
Post hoc feature attribution is a family of methods for giving each feature in an input a score corresponding to its influence on a model's output.
A major limitation of this family of explainers is that they can disagree on which features are more important than others.
We introduce a loss term alongside the standard term corresponding to accuracy, an additional term that measures the difference in feature attribution between a pair of explainers.
We observe on three datasets that we can train a model with this loss term to improve explanation consensus on unseen data, and see improved consensus between explainers other than those used in the loss term
arXiv Detail & Related papers (2023-03-23T14:35:37Z) - Counterfactual Explanations for Support Vector Machine Models [1.933681537640272]
We show how to find counterfactual explanations with the purpose of increasing model interpretability.
We also build a support vector machine model to predict whether law students will pass the Bar exam using protected features.
arXiv Detail & Related papers (2022-12-14T17:13:22Z) - Logical Satisfiability of Counterfactuals for Faithful Explanations in
NLI [60.142926537264714]
We introduce the methodology of Faithfulness-through-Counterfactuals.
It generates a counterfactual hypothesis based on the logical predicates expressed in the explanation.
It then evaluates if the model's prediction on the counterfactual is consistent with that expressed logic.
arXiv Detail & Related papers (2022-05-25T03:40:59Z) - Search Methods for Sufficient, Socially-Aligned Feature Importance
Explanations with In-Distribution Counterfactuals [72.00815192668193]
Feature importance (FI) estimates are a popular form of explanation, and they are commonly created and evaluated by computing the change in model confidence caused by removing certain input features at test time.
We study several under-explored dimensions of FI-based explanations, providing conceptual and empirical improvements for this form of explanation.
arXiv Detail & Related papers (2021-06-01T20:36:48Z) - Beyond Trivial Counterfactual Explanations with Diverse Valuable
Explanations [64.85696493596821]
In computer vision applications, generative counterfactual methods indicate how to perturb a model's input to change its prediction.
We propose a counterfactual method that learns a perturbation in a disentangled latent space that is constrained using a diversity-enforcing loss.
Our model improves the success rate of producing high-quality valuable explanations when compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2021-03-18T12:57:34Z) - Contrastive Explanations for Model Interpretability [77.92370750072831]
We propose a methodology to produce contrastive explanations for classification models.
Our method is based on projecting model representation to a latent space.
Our findings shed light on the ability of label-contrastive explanations to provide a more accurate and finer-grained interpretability of a model's decision.
arXiv Detail & Related papers (2021-03-02T00:36:45Z) - The Struggles of Feature-Based Explanations: Shapley Values vs. Minimal
Sufficient Subsets [61.66584140190247]
We show that feature-based explanations pose problems even for explaining trivial models.
We show that two popular classes of explainers, Shapley explainers and minimal sufficient subsets explainers, target fundamentally different types of ground-truth explanations.
arXiv Detail & Related papers (2020-09-23T09:45:23Z) - Deducing neighborhoods of classes from a fitted model [68.8204255655161]
In this article a new kind of interpretable machine learning method is presented.
It can help to understand the partitioning of the feature space into predicted classes in a classification model using quantile shifts.
Basically, real data points (or specific points of interest) are used and the changes of the prediction after slightly raising or decreasing specific features are observed.
arXiv Detail & Related papers (2020-09-11T16:35:53Z) - Adversarial Infidelity Learning for Model Interpretation [43.37354056251584]
We propose a Model-agnostic Effective Efficient Direct (MEED) IFS framework for model interpretation.
Our framework mitigates concerns about sanity, shortcuts, model identifiability, and information transmission.
Our AIL mechanism can help learn the desired conditional distribution between selected features and targets.
arXiv Detail & Related papers (2020-06-09T16:27:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.