Related papers: Fairwashing Explanations with Off-Manifold Detergent

Fairwashing Explanations with Off-Manifold Detergent

URL: http://arxiv.org/abs/2007.09969v1
Date: Mon, 20 Jul 2020 09:42:06 GMT
Title: Fairwashing Explanations with Off-Manifold Detergent
Authors: Christopher J. Anders, Plamen Pasliev, Ann-Kathrin Dombrowski, Klaus-Robert M\"uller and Pan Kessel
Abstract summary: Explanation methods promise to make black-box classifiers more transparent. We show both theoretically and experimentally that these hopes are presently unfounded. We propose a modification of existing explanation methods which makes them significantly more robust.
Score: 4.934817254755008
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Explanation methods promise to make black-box classifiers more transparent. As a result, it is hoped that they can act as proof for a sensible, fair and trustworthy decision-making process of the algorithm and thereby increase its acceptance by the end-users. In this paper, we show both theoretically and experimentally that these hopes are presently unfounded. Specifically, we show that, for any classifier $g$, one can always construct another classifier $\tilde{g}$ which has the same behavior on the data (same train, validation, and test error) but has arbitrarily manipulated explanation maps. We derive this statement theoretically using differential geometry and demonstrate it experimentally for various explanation methods, architectures, and datasets. Motivated by our theoretical insights, we then propose a modification of existing explanation methods which makes them significantly more robust.

Related papers

I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? [76.15163242945813]
Large language models (LLMs) have led many to conclude that they exhibit a form of intelligence.<n>We introduce a novel generative model that generates tokens on the basis of human-interpretable concepts represented as latent discrete variables.
arXiv Detail & Related papers (2025-03-12T01:21:17Z)
Causal Explanations for Image Classifiers [17.736724129275043]
We present a novel black-box approach to computing explanations grounded in the theory of actual causality. We present an algorithm for computing approximate explanations based on these definitions. We demonstrate that rex is the most efficient tool and produces the smallest explanations.
arXiv Detail & Related papers (2024-11-13T18:52:42Z)
Interaction Asymmetry: A General Principle for Learning Composable Abstractions [27.749478197803256]
We show that interaction asymmetry enables both disentanglement and compositional generalization. We propose an implementation of these criteria using a flexible Transformer-based VAE, with a novel regularizer on the attention weights of the decoder.
arXiv Detail & Related papers (2024-11-12T13:33:26Z)
Abductive Commonsense Reasoning Exploiting Mutually Exclusive Explanations [118.0818807474809]
Abductive reasoning aims to find plausible explanations for an event. Existing approaches for abductive reasoning in natural language processing often rely on manually generated annotations for supervision. This work proposes an approach for abductive commonsense reasoning that exploits the fact that only a subset of explanations is correct for a given context.
arXiv Detail & Related papers (2023-05-24T01:35:10Z)
Evaluating the Robustness of Interpretability Methods through Explanation Invariance and Equivariance [72.50214227616728]
Interpretability methods are valuable only if their explanations faithfully describe the explained model. We consider neural networks whose predictions are invariant under a specific symmetry group.
arXiv Detail & Related papers (2023-04-13T17:59:03Z)
Explanation Selection Using Unlabeled Data for Chain-of-Thought Prompting [80.9896041501715]
Explanations that have not been "tuned" for a task, such as off-the-shelf explanations written by nonexperts, may lead to mediocre performance. This paper tackles the problem of how to optimize explanation-infused prompts in a blackbox fashion.
arXiv Detail & Related papers (2023-02-09T18:02:34Z)
Probing Classifiers are Unreliable for Concept Removal and Detection [18.25734277357466]
Neural network models trained on text data have been found to encode undesirable linguistic or sensitive concepts in their representation. Recent work has proposed post-hoc and adversarial methods to remove such unwanted concepts from a model's representation. We show that these methods can be counter-productive, and in the worst case may end up destroying all task-relevant features.
arXiv Detail & Related papers (2022-07-08T23:15:26Z)
The Manifold Hypothesis for Gradient-Based Explanations [55.01671263121624]
gradient-based explanation algorithms provide perceptually-aligned explanations. We show that the more a feature attribution is aligned with the tangent space of the data, the more perceptually-aligned it tends to be. We suggest that explanation algorithms should actively strive to align their explanations with the data manifold.
arXiv Detail & Related papers (2022-06-15T08:49:24Z)
Don't Explain Noise: Robust Counterfactuals for Randomized Ensembles [50.81061839052459]
We formalize the generation of robust counterfactual explanations as a probabilistic problem. We show the link between the robustness of ensemble models and the robustness of base learners. Our method achieves high robustness with only a small increase in the distance from counterfactual explanations to their initial observations.
arXiv Detail & Related papers (2022-05-27T17:28:54Z)
Causality-based Counterfactual Explanation for Classification Models [11.108866104714627]
We propose a prototype-based counterfactual explanation framework (ProCE) ProCE is capable of preserving the causal relationship underlying the features of the counterfactual data. In addition, we design a novel gradient-free optimization based on the multi-objective genetic algorithm that generates the counterfactual explanations.
arXiv Detail & Related papers (2021-05-03T09:25:59Z)
Explainers in the Wild: Making Surrogate Explainers Robust to Distortions through Perception [77.34726150561087]
We propose a methodology to evaluate the effect of distortions in explanations by embedding perceptual distances. We generate explanations for images in the Imagenet-C dataset and demonstrate how using a perceptual distances in the surrogate explainer creates more coherent explanations for the distorted and reference images.
arXiv Detail & Related papers (2021-02-22T12:38:53Z)
Towards the Unification and Robustness of Perturbation and Gradient Based Explanations [23.41512277145231]
We analyze two popular post hoc interpretation techniques: SmoothGrad which is a gradient based method, and a variant of LIME which is a perturbation based method. We derive explicit closed form expressions for the explanations output by these two methods and show that they both converge to the same explanation in expectation. We empirically validate our theory using extensive experimentation on both synthetic and real world datasets.
arXiv Detail & Related papers (2021-02-21T14:51:18Z)
On Generating Plausible Counterfactual and Semi-Factual Explanations for Deep Learning [15.965337956587373]
PlausIble Exceptionality-based Contrastive Explanations (PIECE), modifies all exceptional features in a test image to be normal from the perspective of the counterfactual class. Two controlled experiments compare PIECE to others in the literature, showing that PIECE not only generates the most plausible counterfactuals on several measures, but also the best semifactuals.
arXiv Detail & Related papers (2020-09-10T14:48:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.