Fairwashing Explanations with Off-Manifold Detergent
- URL: http://arxiv.org/abs/2007.09969v1
- Date: Mon, 20 Jul 2020 09:42:06 GMT
- Title: Fairwashing Explanations with Off-Manifold Detergent
- Authors: Christopher J. Anders, Plamen Pasliev, Ann-Kathrin Dombrowski,
Klaus-Robert M\"uller and Pan Kessel
- Abstract summary: Explanation methods promise to make black-box classifiers more transparent.
We show both theoretically and experimentally that these hopes are presently unfounded.
We propose a modification of existing explanation methods which makes them significantly more robust.
- Score: 4.934817254755008
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Explanation methods promise to make black-box classifiers more transparent.
As a result, it is hoped that they can act as proof for a sensible, fair and
trustworthy decision-making process of the algorithm and thereby increase its
acceptance by the end-users. In this paper, we show both theoretically and
experimentally that these hopes are presently unfounded. Specifically, we show
that, for any classifier $g$, one can always construct another classifier
$\tilde{g}$ which has the same behavior on the data (same train, validation,
and test error) but has arbitrarily manipulated explanation maps. We derive
this statement theoretically using differential geometry and demonstrate it
experimentally for various explanation methods, architectures, and datasets.
Motivated by our theoretical insights, we then propose a modification of
existing explanation methods which makes them significantly more robust.
Related papers
- Abductive Commonsense Reasoning Exploiting Mutually Exclusive
Explanations [118.0818807474809]
Abductive reasoning aims to find plausible explanations for an event.
Existing approaches for abductive reasoning in natural language processing often rely on manually generated annotations for supervision.
This work proposes an approach for abductive commonsense reasoning that exploits the fact that only a subset of explanations is correct for a given context.
arXiv Detail & Related papers (2023-05-24T01:35:10Z) - Evaluating the Robustness of Interpretability Methods through
Explanation Invariance and Equivariance [72.50214227616728]
Interpretability methods are valuable only if their explanations faithfully describe the explained model.
We consider neural networks whose predictions are invariant under a specific symmetry group.
arXiv Detail & Related papers (2023-04-13T17:59:03Z) - Explanation Selection Using Unlabeled Data for Chain-of-Thought
Prompting [80.9896041501715]
Explanations that have not been "tuned" for a task, such as off-the-shelf explanations written by nonexperts, may lead to mediocre performance.
This paper tackles the problem of how to optimize explanation-infused prompts in a blackbox fashion.
arXiv Detail & Related papers (2023-02-09T18:02:34Z) - Probing Classifiers are Unreliable for Concept Removal and Detection [18.25734277357466]
Neural network models trained on text data have been found to encode undesirable linguistic or sensitive concepts in their representation.
Recent work has proposed post-hoc and adversarial methods to remove such unwanted concepts from a model's representation.
We show that these methods can be counter-productive, and in the worst case may end up destroying all task-relevant features.
arXiv Detail & Related papers (2022-07-08T23:15:26Z) - The Manifold Hypothesis for Gradient-Based Explanations [55.01671263121624]
gradient-based explanation algorithms provide perceptually-aligned explanations.
We show that the more a feature attribution is aligned with the tangent space of the data, the more perceptually-aligned it tends to be.
We suggest that explanation algorithms should actively strive to align their explanations with the data manifold.
arXiv Detail & Related papers (2022-06-15T08:49:24Z) - Don't Explain Noise: Robust Counterfactuals for Randomized Ensembles [50.81061839052459]
We formalize the generation of robust counterfactual explanations as a probabilistic problem.
We show the link between the robustness of ensemble models and the robustness of base learners.
Our method achieves high robustness with only a small increase in the distance from counterfactual explanations to their initial observations.
arXiv Detail & Related papers (2022-05-27T17:28:54Z) - Provable concept learning for interpretable predictions using
variational inference [7.0349768355860895]
In safety critical applications, practitioners are reluctant to trust neural networks when no interpretable explanations are available.
We propose a probabilistic modeling framework to derive (C)oncept (L)earning and (P)rediction (CLAP)
We prove that our method is able to identify them while attaining optimal classification accuracy.
arXiv Detail & Related papers (2022-04-01T14:51:38Z) - Causality-based Counterfactual Explanation for Classification Models [11.108866104714627]
We propose a prototype-based counterfactual explanation framework (ProCE)
ProCE is capable of preserving the causal relationship underlying the features of the counterfactual data.
In addition, we design a novel gradient-free optimization based on the multi-objective genetic algorithm that generates the counterfactual explanations.
arXiv Detail & Related papers (2021-05-03T09:25:59Z) - Explainers in the Wild: Making Surrogate Explainers Robust to
Distortions through Perception [77.34726150561087]
We propose a methodology to evaluate the effect of distortions in explanations by embedding perceptual distances.
We generate explanations for images in the Imagenet-C dataset and demonstrate how using a perceptual distances in the surrogate explainer creates more coherent explanations for the distorted and reference images.
arXiv Detail & Related papers (2021-02-22T12:38:53Z) - Towards the Unification and Robustness of Perturbation and Gradient
Based Explanations [23.41512277145231]
We analyze two popular post hoc interpretation techniques: SmoothGrad which is a gradient based method, and a variant of LIME which is a perturbation based method.
We derive explicit closed form expressions for the explanations output by these two methods and show that they both converge to the same explanation in expectation.
We empirically validate our theory using extensive experimentation on both synthetic and real world datasets.
arXiv Detail & Related papers (2021-02-21T14:51:18Z) - On Generating Plausible Counterfactual and Semi-Factual Explanations for
Deep Learning [15.965337956587373]
PlausIble Exceptionality-based Contrastive Explanations (PIECE), modifies all exceptional features in a test image to be normal from the perspective of the counterfactual class.
Two controlled experiments compare PIECE to others in the literature, showing that PIECE not only generates the most plausible counterfactuals on several measures, but also the best semifactuals.
arXiv Detail & Related papers (2020-09-10T14:48:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.