Robustness of Explanation Methods for NLP Models
- URL: http://arxiv.org/abs/2206.12284v1
- Date: Fri, 24 Jun 2022 13:34:07 GMT
- Title: Robustness of Explanation Methods for NLP Models
- Authors: Shriya Atmakuri, Tejas Chheda, Dinesh Kandula, Nishant Yadav, Taesung
Lee, Hessel Tuinhof
- Abstract summary: Explanation methods have emerged as an important tool to highlight the features responsible for the predictions of neural networks.
There is mounting evidence that many explanation methods are rather unreliable and susceptible to malicious manipulations.
We provide initial insights and results towards devising a successful adversarial attack against text explanations.
- Score: 5.191443390565865
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Explanation methods have emerged as an important tool to highlight the
features responsible for the predictions of neural networks. There is mounting
evidence that many explanation methods are rather unreliable and susceptible to
malicious manipulations. In this paper, we particularly aim to understand the
robustness of explanation methods in the context of text modality. We provide
initial insights and results towards devising a successful adversarial attack
against text explanations. To our knowledge, this is the first attempt to
evaluate the adversarial robustness of an explanation method. Our experiments
show the explanation method can be largely disturbed for up to 86% of the
tested samples with small changes in the input sentence and its semantics.
Related papers
- An AI Architecture with the Capability to Explain Recognition Results [0.0]
This research focuses on the importance of metrics to explainability and contributes two methods yielding performance gains.
The first method introduces a combination of explainable and unexplainable flows, proposing a metric to characterize explainability of a decision.
The second method compares classic metrics for estimating the effectiveness of neural networks in the system, posing a new metric as the leading performer.
arXiv Detail & Related papers (2024-06-13T02:00:13Z) - Counterfactuals of Counterfactuals: a back-translation-inspired approach
to analyse counterfactual editors [3.4253416336476246]
We focus on the analysis of counterfactual, contrastive explanations.
We propose a new back translation-inspired evaluation methodology.
We show that by iteratively feeding the counterfactual to the explainer we can obtain valuable insights into the behaviour of both the predictor and the explainer models.
arXiv Detail & Related papers (2023-05-26T16:04:28Z) - Abductive Commonsense Reasoning Exploiting Mutually Exclusive
Explanations [118.0818807474809]
Abductive reasoning aims to find plausible explanations for an event.
Existing approaches for abductive reasoning in natural language processing often rely on manually generated annotations for supervision.
This work proposes an approach for abductive commonsense reasoning that exploits the fact that only a subset of explanations is correct for a given context.
arXiv Detail & Related papers (2023-05-24T01:35:10Z) - Explanation Selection Using Unlabeled Data for Chain-of-Thought
Prompting [80.9896041501715]
Explanations that have not been "tuned" for a task, such as off-the-shelf explanations written by nonexperts, may lead to mediocre performance.
This paper tackles the problem of how to optimize explanation-infused prompts in a blackbox fashion.
arXiv Detail & Related papers (2023-02-09T18:02:34Z) - Robust Explanation Constraints for Neural Networks [33.14373978947437]
Post-hoc explanation methods used with the intent of neural networks are sometimes said to help engender trust in their outputs.
Our training method is the only method able to learn neural networks with insights about robustness tested across all six tested networks.
arXiv Detail & Related papers (2022-12-16T14:40:25Z) - Testing the effectiveness of saliency-based explainability in NLP using
randomized survey-based experiments [0.6091702876917281]
A lot of work in Explainable AI has aimed to devise explanation methods that give humans insights into the workings and predictions of NLP models.
Innate human tendencies and biases can handicap the understanding of these explanations in humans.
We designed a randomized survey-based experiment to understand the effectiveness of saliency-based Post-hoc explainability methods in Natural Language Processing.
arXiv Detail & Related papers (2022-11-25T08:49:01Z) - Don't Explain Noise: Robust Counterfactuals for Randomized Ensembles [50.81061839052459]
We formalize the generation of robust counterfactual explanations as a probabilistic problem.
We show the link between the robustness of ensemble models and the robustness of base learners.
Our method achieves high robustness with only a small increase in the distance from counterfactual explanations to their initial observations.
arXiv Detail & Related papers (2022-05-27T17:28:54Z) - Logical Satisfiability of Counterfactuals for Faithful Explanations in
NLI [60.142926537264714]
We introduce the methodology of Faithfulness-through-Counterfactuals.
It generates a counterfactual hypothesis based on the logical predicates expressed in the explanation.
It then evaluates if the model's prediction on the counterfactual is consistent with that expressed logic.
arXiv Detail & Related papers (2022-05-25T03:40:59Z) - Human Interpretation of Saliency-based Explanation Over Text [65.29015910991261]
We study saliency-based explanations over textual data.
We find that people often mis-interpret the explanations.
We propose a method to adjust saliencies based on model estimates of over- and under-perception.
arXiv Detail & Related papers (2022-01-27T15:20:32Z) - Unsupervised Detection of Adversarial Examples with Model Explanations [0.6091702876917279]
We propose a simple yet effective method to detect adversarial examples using methods developed to explain the model's behavior.
Our evaluations with MNIST handwritten dataset show that our method is capable of detecting adversarial examples with high confidence.
arXiv Detail & Related papers (2021-07-22T06:54:18Z) - Evaluations and Methods for Explanation through Robustness Analysis [117.7235152610957]
We establish a novel set of evaluation criteria for such feature based explanations by analysis.
We obtain new explanations that are loosely necessary and sufficient for a prediction.
We extend the explanation to extract the set of features that would move the current prediction to a target class.
arXiv Detail & Related papers (2020-05-31T05:52:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.