Do Input Gradients Highlight Discriminative Features?
- URL: http://arxiv.org/abs/2102.12781v1
- Date: Thu, 25 Feb 2021 11:04:38 GMT
- Title: Do Input Gradients Highlight Discriminative Features?
- Authors: Harshay Shah, Prateek Jain, Praneeth Netrapalli
- Abstract summary: Interpretability methods seek to explain instance-specific model predictions.
We introduce an evaluation framework to study this hypothesis for benchmark image classification tasks.
We make two surprising observations on CIFAR-10 and Imagenet-10 datasets.
- Score: 42.47346844105727
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Interpretability methods that seek to explain instance-specific model
predictions [Simonyan et al. 2014, Smilkov et al. 2017] are often based on the
premise that the magnitude of input-gradient -- gradient of the loss with
respect to input -- highlights discriminative features that are relevant for
prediction over non-discriminative features that are irrelevant for prediction.
In this work, we introduce an evaluation framework to study this hypothesis for
benchmark image classification tasks, and make two surprising observations on
CIFAR-10 and Imagenet-10 datasets: (a) contrary to conventional wisdom, input
gradients of standard models (i.e., trained on the original data) actually
highlight irrelevant features over relevant features; (b) however, input
gradients of adversarially robust models (i.e., trained on adversarially
perturbed data) starkly highlight relevant features over irrelevant features.
To better understand input gradients, we introduce a synthetic testbed and
theoretically justify our counter-intuitive empirical findings. Our
observations motivate the need to formalize and verify common assumptions in
interpretability, while our evaluation framework and synthetic dataset serve as
a testbed to rigorously analyze instance-specific interpretability methods.
Related papers
- LUCID-GAN: Conditional Generative Models to Locate Unfairness [1.5257247496416746]
We present LUCID-GAN, which generates canonical inputs via a conditional generative model instead of gradient-based inverse design.
We empirically evaluate LUCID-GAN on the UCI Adult and COMPAS data sets and show that it allows for detecting unethical biases in black-box models without requiring access to the training data.
arXiv Detail & Related papers (2023-07-28T10:37:49Z) - Generalizing Backpropagation for Gradient-Based Interpretability [103.2998254573497]
We show that the gradient of a model is a special case of a more general formulation using semirings.
This observation allows us to generalize the backpropagation algorithm to efficiently compute other interpretable statistics.
arXiv Detail & Related papers (2023-07-06T15:19:53Z) - Model Debiasing via Gradient-based Explanation on Representation [14.673988027271388]
We propose a novel fairness framework that performs debiasing with regard to sensitive attributes and proxy attributes.
Our framework achieves better fairness-accuracy trade-off on unstructured and structured datasets than previous state-of-the-art approaches.
arXiv Detail & Related papers (2023-05-20T11:57:57Z) - Measuring Implicit Bias Using SHAP Feature Importance and Fuzzy
Cognitive Maps [1.9739269019020032]
In this paper, we integrate the concepts of feature importance with implicit bias in the context of pattern classification.
The amount of bias towards protected features might differ depending on whether the features are numerically or categorically encoded.
arXiv Detail & Related papers (2023-05-16T12:31:36Z) - Rethinking interpretation: Input-agnostic saliency mapping of deep
visual classifiers [28.28834523468462]
Saliency methods provide post-hoc model interpretation by attributing input features to the model outputs.
We show that input-specific saliency mapping is intrinsically susceptible to misleading feature attribution.
We introduce a new perspective of input-agnostic saliency mapping that computationally estimates the high-level features attributed by the model to its outputs.
arXiv Detail & Related papers (2023-03-31T06:58:45Z) - Semi-FairVAE: Semi-supervised Fair Representation Learning with
Adversarial Variational Autoencoder [92.67156911466397]
We propose a semi-supervised fair representation learning approach based on adversarial variational autoencoder.
We use a bias-aware model to capture inherent bias information on sensitive attribute.
We also use a bias-free model to learn debiased fair representations by using adversarial learning to remove bias information from them.
arXiv Detail & Related papers (2022-04-01T15:57:47Z) - Generative Counterfactuals for Neural Networks via Attribute-Informed
Perturbation [51.29486247405601]
We design a framework to generate counterfactuals for raw data instances with the proposed Attribute-Informed Perturbation (AIP)
By utilizing generative models conditioned with different attributes, counterfactuals with desired labels can be obtained effectively and efficiently.
Experimental results on real-world texts and images demonstrate the effectiveness, sample quality as well as efficiency of our designed framework.
arXiv Detail & Related papers (2021-01-18T08:37:13Z) - Achieving Equalized Odds by Resampling Sensitive Attributes [13.114114427206678]
We present a flexible framework for learning predictive models that approximately satisfy the equalized odds notion of fairness.
This differentiable functional is used as a penalty driving the model parameters towards equalized odds.
We develop a formal hypothesis test to detect whether a prediction rule violates this property, the first such test in the literature.
arXiv Detail & Related papers (2020-06-08T00:18:34Z) - Learning What Makes a Difference from Counterfactual Examples and
Gradient Supervision [57.14468881854616]
We propose an auxiliary training objective that improves the generalization capabilities of neural networks.
We use pairs of minimally-different examples with different labels, a.k.a counterfactual or contrasting examples, which provide a signal indicative of the underlying causal structure of the task.
Models trained with this technique demonstrate improved performance on out-of-distribution test sets.
arXiv Detail & Related papers (2020-04-20T02:47:49Z) - Fundamental Tradeoffs between Invariance and Sensitivity to Adversarial
Perturbations [65.05561023880351]
Adversarial examples are malicious inputs crafted to induce misclassification.
This paper studies a complementary failure mode, invariance-based adversarial examples.
We show that defenses against sensitivity-based attacks actively harm a model's accuracy on invariance-based attacks.
arXiv Detail & Related papers (2020-02-11T18:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.