Gradient-based Analysis of NLP Models is Manipulable
- URL: http://arxiv.org/abs/2010.05419v1
- Date: Mon, 12 Oct 2020 02:54:22 GMT
- Title: Gradient-based Analysis of NLP Models is Manipulable
- Authors: Junlin Wang, Jens Tuyls, Eric Wallace, Sameer Singh
- Abstract summary: We demonstrate that the gradients of a model are easily manipulable, and thus bring into question the reliability of gradient-based analyses.
In particular, we merge the layers of a target model with a Facade that overwhelms the gradients without affecting the predictions.
- Score: 44.215057692679494
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Gradient-based analysis methods, such as saliency map visualizations and
adversarial input perturbations, have found widespread use in interpreting
neural NLP models due to their simplicity, flexibility, and most importantly,
their faithfulness. In this paper, however, we demonstrate that the gradients
of a model are easily manipulable, and thus bring into question the reliability
of gradient-based analyses. In particular, we merge the layers of a target
model with a Facade that overwhelms the gradients without affecting the
predictions. This Facade can be trained to have gradients that are misleading
and irrelevant to the task, such as focusing only on the stop words in the
input. On a variety of NLP tasks (text classification, NLI, and QA), we show
that our method can manipulate numerous gradient-based analysis techniques:
saliency maps, input reduction, and adversarial perturbations all identify
unimportant or targeted tokens as being highly important. The code and a
tutorial of this paper is available at http://ucinlp.github.io/facade.
Related papers
- Derivative-Free Diffusion Manifold-Constrained Gradient for Unified XAI [59.96044730204345]
We introduce Derivative-Free Diffusion Manifold-Constrainted Gradients (FreeMCG)
FreeMCG serves as an improved basis for explainability of a given neural network.
We show that our method yields state-of-the-art results while preserving the essential properties expected of XAI tools.
arXiv Detail & Related papers (2024-11-22T11:15:14Z) - Unlearning-based Neural Interpretations [51.99182464831169]
We show that current baselines defined using static functions are biased, fragile and manipulable.
We propose UNI to compute an (un)learnable, debiased and adaptive baseline by perturbing the input towards an unlearning direction of steepest ascent.
arXiv Detail & Related papers (2024-10-10T16:02:39Z) - Probing the Purview of Neural Networks via Gradient Analysis [13.800680101300756]
We analyze the data-dependent capacity of neural networks and assess anomalies in inputs from the perspective of networks during inference.
To probe the purview of a network, we utilize gradients to measure the amount of change required for the model to characterize the given inputs more accurately.
We demonstrate that our gradient-based approach can effectively differentiate inputs that cannot be accurately represented with learned features.
arXiv Detail & Related papers (2023-04-06T03:02:05Z) - Tell Model Where to Attend: Improving Interpretability of Aspect-Based
Sentiment Classification via Small Explanation Annotations [23.05672636220897]
We propose an textbfInterpretation-textbfEnhanced textbfGradient-based framework for textbfABSC via a small number of explanation annotations, namely textttIEGA.
Our model is model agnostic and task agnostic so that it can be integrated into the existing ABSC methods or other tasks.
arXiv Detail & Related papers (2023-02-21T06:55:08Z) - Locally Aggregated Feature Attribution on Natural Language Model
Understanding [12.233103741197334]
Locally Aggregated Feature Attribution (LAFA) is a novel gradient-based feature attribution method for NLP models.
Instead of relying on obscure reference tokens, it smooths gradients by aggregating similar reference texts derived from language model embeddings.
For evaluation purpose, we also design experiments on different NLP tasks including Entity Recognition and Sentiment Analysis on public datasets.
arXiv Detail & Related papers (2022-04-22T18:59:27Z) - Bayesian Graph Contrastive Learning [55.36652660268726]
We propose a novel perspective of graph contrastive learning methods showing random augmentations leads to encoders.
Our proposed method represents each node by a distribution in the latent space in contrast to existing techniques which embed each node to a deterministic vector.
We show a considerable improvement in performance compared to existing state-of-the-art methods on several benchmark datasets.
arXiv Detail & Related papers (2021-12-15T01:45:32Z) - Revealing and Protecting Labels in Distributed Training [3.18475216176047]
We propose a method to discover the set of labels of training samples from only the gradient of the last layer and the id to label mapping.
We demonstrate the effectiveness of our method for model training in two domains - image classification, and automatic speech recognition.
arXiv Detail & Related papers (2021-10-31T17:57:49Z) - Deep learning: a statistical viewpoint [120.94133818355645]
Deep learning has revealed some major surprises from a theoretical perspective.
In particular, simple gradient methods easily find near-perfect solutions to non-optimal training problems.
We conjecture that specific principles underlie these phenomena.
arXiv Detail & Related papers (2021-03-16T16:26:36Z) - Interpreting Graph Neural Networks for NLP With Differentiable Edge
Masking [63.49779304362376]
Graph neural networks (GNNs) have become a popular approach to integrating structural inductive biases into NLP models.
We introduce a post-hoc method for interpreting the predictions of GNNs which identifies unnecessary edges.
We show that we can drop a large proportion of edges without deteriorating the performance of the model.
arXiv Detail & Related papers (2020-10-01T17:51:19Z) - Gradients as a Measure of Uncertainty in Neural Networks [16.80077149399317]
We propose to utilize backpropagated gradients to quantify the uncertainty of trained models.
We show that our gradient-based method outperforms state-of-the-art methods by up to 4.8% of AUROC score in out-of-distribution detection.
arXiv Detail & Related papers (2020-08-18T16:58:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.