Related papers: Estimating the Adversarial Robustness of Attributions in Text with Transformers

Estimating the Adversarial Robustness of Attributions in Text with Transformers

URL: http://arxiv.org/abs/2212.09155v1
Date: Sun, 18 Dec 2022 20:18:59 GMT
Title: Estimating the Adversarial Robustness of Attributions in Text with Transformers
Authors: Adam Ivankay, Mattia Rigotti, Ivan Girardi, Chiara Marchiori, Pascal Frossard
Abstract summary: We establish a novel definition of attribution robustness (AR) in text classification, based on Lipschitz continuity. We then propose our novel TransformerExplanationAttack (TEA), a strong adversary that provides a tight estimation for attribution in text classification.
Score: 44.745873282080346
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Explanations are crucial parts of deep neural network (DNN) classifiers. In high stakes applications, faithful and robust explanations are important to understand and gain trust in DNN classifiers. However, recent work has shown that state-of-the-art attribution methods in text classifiers are susceptible to imperceptible adversarial perturbations that alter explanations significantly while maintaining the correct prediction outcome. If undetected, this can critically mislead the users of DNNs. Thus, it is crucial to understand the influence of such adversarial perturbations on the networks' explanations and their perceptibility. In this work, we establish a novel definition of attribution robustness (AR) in text classification, based on Lipschitz continuity. Crucially, it reflects both attribution change induced by adversarial input alterations and perceptibility of such alterations. Moreover, we introduce a wide set of text similarity measures to effectively capture locality between two text samples and imperceptibility of adversarial perturbations in text. We then propose our novel TransformerExplanationAttack (TEA), a strong adversary that provides a tight estimation for attribution robustness in text classification. TEA uses state-of-the-art language models to extract word substitutions that result in fluent, contextual adversarial samples. Finally, with experiments on several text classification architectures, we show that TEA consistently outperforms current state-of-the-art AR estimators, yielding perturbations that alter explanations to a greater extent while being more fluent and less perceptible.

Related papers

Lost In Translation: Generating Adversarial Examples Robust to Round-Trip Translation [66.33340583035374]
We present a comprehensive study on the robustness of current text adversarial attacks to round-trip translation. We demonstrate that 6 state-of-the-art text-based adversarial attacks do not maintain their efficacy after round-trip translation. We introduce an intervention-based solution to this problem, by integrating Machine Translation into the process of adversarial example generation.
arXiv Detail & Related papers (2023-07-24T04:29:43Z)
Interpretability and Transparency-Driven Detection and Transformation of Textual Adversarial Examples (IT-DT) [0.5729426778193399]
We propose the Interpretability and Transparency-Driven Detection and Transformation (IT-DT) framework. It focuses on interpretability and transparency in detecting and transforming textual adversarial examples. IT-DT significantly improves the resilience and trustworthiness of transformer-based text classifiers against adversarial attacks.
arXiv Detail & Related papers (2023-07-03T03:17:20Z)
In and Out-of-Domain Text Adversarial Robustness via Label Smoothing [64.66809713499576]
We study the adversarial robustness provided by various label smoothing strategies in foundational models for diverse NLP tasks. Our experiments show that label smoothing significantly improves adversarial robustness in pre-trained models like BERT, against various popular attacks. We also analyze the relationship between prediction confidence and robustness, showing that label smoothing reduces over-confident errors on adversarial examples.
arXiv Detail & Related papers (2022-12-20T14:06:50Z)
Disentangled Text Representation Learning with Information-Theoretic Perspective for Adversarial Robustness [17.5771010094384]
Adversarial vulnerability remains a major obstacle to constructing reliable NLP systems. Recent work argues the adversarial vulnerability of the model is caused by the non-robust features in supervised training. In this paper, we tackle the adversarial challenge from the view of disentangled representation learning.
arXiv Detail & Related papers (2022-10-26T18:14:39Z)
Beyond Model Interpretability: On the Faithfulness and Adversarial Robustness of Contrastive Textual Explanations [2.543865489517869]
This work motivates textual counterfactuals by laying the ground for a novel evaluation scheme inspired by the faithfulness of explanations. Experiments on sentiment analysis data show that the connectedness of counterfactuals to their original counterparts is not obvious in both models.
arXiv Detail & Related papers (2022-10-17T09:50:02Z)
Fooling Explanations in Text Classifiers [42.49606659285249]
We introduceTextExplanationer (TEF), a novel explanation attack algorithm that alters text input samples imperceptibly. TEF can significantly decrease the correlation between unchanged and perturbed input attributions. We show that TEF perturbations are also effective in scenarios where the target model and explanation method are unknown.
arXiv Detail & Related papers (2022-06-07T10:58:08Z)
Block-Sparse Adversarial Attack to Fool Transformer-Based Text Classifiers [49.50163349643615]
In this paper, we propose a gradient-based adversarial attack against transformer-based text classifiers. Experimental results demonstrate that, while our adversarial attack maintains the semantics of the sentence, it can reduce the accuracy of GPT-2 to less than 5%.
arXiv Detail & Related papers (2022-03-11T14:37:41Z)
Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation. This paper aims to address the issue with a mask-and-predict strategy. We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions. Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z)
Towards Accurate Scene Text Recognition with Semantic Reasoning Networks [52.86058031919856]
We propose a novel end-to-end trainable framework named semantic reasoning network (SRN) for accurate scene text recognition. GSRM is introduced to capture global semantic context through multi-way parallel transmission. Results on 7 public benchmarks, including regular text, irregular text and non-Latin long text, verify the effectiveness and robustness of the proposed method.
arXiv Detail & Related papers (2020-03-27T09:19:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.