Perturbing Inputs for Fragile Interpretations in Deep Natural Language
Processing
- URL: http://arxiv.org/abs/2108.04990v1
- Date: Wed, 11 Aug 2021 02:07:21 GMT
- Title: Perturbing Inputs for Fragile Interpretations in Deep Natural Language
Processing
- Authors: Sanchit Sinha, Hanjie Chen, Arshdeep Sekhon, Yangfeng Ji, Yanjun Qi
- Abstract summary: Interpretability methods need to be robust for trustworthy NLP applications in high-stake areas like medicine or finance.
Our paper demonstrates how interpretations can be manipulated by making simple word perturbations on an input text.
- Score: 18.91129968022831
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Interpretability methods like Integrated Gradient and LIME are popular
choices for explaining natural language model predictions with relative word
importance scores. These interpretations need to be robust for trustworthy NLP
applications in high-stake areas like medicine or finance. Our paper
demonstrates how interpretations can be manipulated by making simple word
perturbations on an input text. Via a small portion of word-level swaps, these
adversarial perturbations aim to make the resulting text semantically and
spatially similar to its seed input (therefore sharing similar
interpretations). Simultaneously, the generated examples achieve the same
prediction label as the seed yet are given a substantially different
explanation by the interpretation methods. Our experiments generate fragile
interpretations to attack two SOTA interpretation methods, across three popular
Transformer models and on two different NLP datasets. We observe that the rank
order correlation drops by over 20% when less than 10% of words are perturbed
on average. Further, rank-order correlation keeps decreasing as more words get
perturbed. Furthermore, we demonstrate that candidates generated from our
method have good quality metrics.
Related papers
- Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective [50.261681681643076]
We propose a novel metric called SemVarEffect and a benchmark named SemVarBench to evaluate the causality between semantic variations in inputs and outputs in text-to-image synthesis.
Our work establishes an effective evaluation framework that advances the T2I synthesis community's exploration of human instruction understanding.
arXiv Detail & Related papers (2024-10-14T08:45:35Z) - Naturalistic Causal Probing for Morpho-Syntax [76.83735391276547]
We suggest a naturalistic strategy for input-level intervention on real world data in Spanish.
Using our approach, we isolate morpho-syntactic features from counfounders in sentences.
We apply this methodology to analyze causal effects of gender and number on contextualized representations extracted from pre-trained models.
arXiv Detail & Related papers (2022-05-14T11:47:58Z) - Block-Sparse Adversarial Attack to Fool Transformer-Based Text
Classifiers [49.50163349643615]
In this paper, we propose a gradient-based adversarial attack against transformer-based text classifiers.
Experimental results demonstrate that, while our adversarial attack maintains the semantics of the sentence, it can reduce the accuracy of GPT-2 to less than 5%.
arXiv Detail & Related papers (2022-03-11T14:37:41Z) - Data-Driven Mitigation of Adversarial Text Perturbation [1.3649494534428743]
We propose a deobfuscation pipeline to make NLP models robust to adversarial text perturbations.
We show CW2V embeddings are generally more robust to text perturbations than embeddings based on character ngrams.
Our pipeline results in engagement bait classification that goes from 0.70 to 0.67 AUC with adversarial text perturbation.
arXiv Detail & Related papers (2022-02-19T00:49:12Z) - More Than Words: Towards Better Quality Interpretations of Text
Classifiers [16.66535643383862]
We show that token-based interpretability, while being a convenient first choice given the input interfaces of the ML models, is not the most effective one in all situations.
We show that higher-level feature attributions offer several advantages: 1) they are more robust as measured by the randomization tests, 2) they lead to lower variability when using approximation-based methods like SHAP, and 3) they are more intelligible to humans in situations where the linguistic coherence resides at a higher level.
arXiv Detail & Related papers (2021-12-23T10:18:50Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - Statistically significant detection of semantic shifts using contextual
word embeddings [7.439525715543974]
We propose an approach to estimate semantic shifts by combining contextual word embeddings with permutation-based statistical tests.
We demonstrate the performance of this approach in simulation, achieving consistently high precision by suppressing false positives.
We additionally analyzed real-world data from SemEval-2020 Task 1 and the Liverpool FC subreddit corpus.
arXiv Detail & Related papers (2021-04-08T13:58:54Z) - Are Interpretations Fairly Evaluated? A Definition Driven Pipeline for
Post-Hoc Interpretability [54.85658598523915]
We propose to have a concrete definition of interpretation before we could evaluate faithfulness of an interpretation.
We find that although interpretation methods perform differently under a certain evaluation metric, such a difference may not result from interpretation quality or faithfulness.
arXiv Detail & Related papers (2020-09-16T06:38:03Z) - SLAM-Inspired Simultaneous Contextualization and Interpreting for
Incremental Conversation Sentences [0.0]
We propose a method to dynamically estimate the context and interpretations of polysemous words in sequential sentences.
By using the SCAIN algorithm, we can sequentially optimize the interdependence between context and word interpretation while obtaining new interpretations online.
arXiv Detail & Related papers (2020-05-29T16:40:27Z) - Syntactic Structure Distillation Pretraining For Bidirectional Encoders [49.483357228441434]
We introduce a knowledge distillation strategy for injecting syntactic biases into BERT pretraining.
We distill the approximate marginal distribution over words in context from the syntactic LM.
Our findings demonstrate the benefits of syntactic biases, even in representation learners that exploit large amounts of data.
arXiv Detail & Related papers (2020-05-27T16:44:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.