Related papers: Fooling Explanations in Text Classifiers

Fooling Explanations in Text Classifiers

URL: http://arxiv.org/abs/2206.03178v1
Date: Tue, 7 Jun 2022 10:58:08 GMT
Title: Fooling Explanations in Text Classifiers
Authors: Adam Ivankay, Ivan Girardi, Chiara Marchiori, Pascal Frossard
Abstract summary: We introduceTextExplanationer (TEF), a novel explanation attack algorithm that alters text input samples imperceptibly. TEF can significantly decrease the correlation between unchanged and perturbed input attributions. We show that TEF perturbations are also effective in scenarios where the target model and explanation method are unknown.
Score: 42.49606659285249
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: State-of-the-art text classification models are becoming increasingly reliant on deep neural networks (DNNs). Due to their black-box nature, faithful and robust explanation methods need to accompany classifiers for deployment in real-life scenarios. However, it has been shown in vision applications that explanation methods are susceptible to local, imperceptible perturbations that can significantly alter the explanations without changing the predicted classes. We show here that the existence of such perturbations extends to text classifiers as well. Specifically, we introduceTextExplanationFooler (TEF), a novel explanation attack algorithm that alters text input samples imperceptibly so that the outcome of widely-used explanation methods changes considerably while leaving classifier predictions unchanged. We evaluate the performance of the attribution robustness estimation performance in TEF on five sequence classification datasets, utilizing three DNN architectures and three transformer architectures for each dataset. TEF can significantly decrease the correlation between unchanged and perturbed input attributions, which shows that all models and explanation methods are susceptible to TEF perturbations. Moreover, we evaluate how the perturbations transfer to other model architectures and attribution methods, and show that TEF perturbations are also effective in scenarios where the target model and explanation method are unknown. Finally, we introduce a semi-universal attack that is able to compute fast, computationally light perturbations with no knowledge of the attacked classifier nor explanation method. Overall, our work shows that explanations in text classifiers are very fragile and users need to carefully address their robustness before relying on them in critical applications.

Related papers

A Comparative Analysis of Counterfactual Explanation Methods for Text Classifiers [0.0]
We evaluate five methods for generating counterfactual explanations for a BERT text classifier. established white-box substitution-based methods are effective at generating valid counterfactuals that change the classifier's output. newer methods based on large language models (LLMs) excel at producing natural and linguistically plausible text counterfactuals.
arXiv Detail & Related papers (2024-11-04T22:01:52Z)
Explaining Text Similarity in Transformer Models [52.571158418102584]
Recent advances in explainable AI have made it possible to mitigate limitations by leveraging improved explanations for Transformers. We use BiLRP, an extension developed for computing second-order explanations in bilinear similarity models, to investigate which feature interactions drive similarity in NLP models. Our findings contribute to a deeper understanding of different semantic similarity tasks and models, highlighting how novel explainable AI methods enable in-depth analyses and corpus-level insights.
arXiv Detail & Related papers (2024-05-10T17:11:31Z)
Introducing User Feedback-based Counterfactual Explanations (UFCE) [49.1574468325115]
Counterfactual explanations (CEs) have emerged as a viable solution for generating comprehensible explanations in XAI. UFCE allows for the inclusion of user constraints to determine the smallest modifications in the subset of actionable features. UFCE outperforms two well-known CE methods in terms of textitproximity, textitsparsity, and textitfeasibility.
arXiv Detail & Related papers (2024-02-26T20:09:44Z)
Counterfactuals of Counterfactuals: a back-translation-inspired approach to analyse counterfactual editors [3.4253416336476246]
We focus on the analysis of counterfactual, contrastive explanations. We propose a new back translation-inspired evaluation methodology. We show that by iteratively feeding the counterfactual to the explainer we can obtain valuable insights into the behaviour of both the predictor and the explainer models.
arXiv Detail & Related papers (2023-05-26T16:04:28Z)
Adversarial Counterfactual Visual Explanations [0.7366405857677227]
This paper proposes an elegant method to turn adversarial attacks into semantically meaningful perturbations. The proposed approach hypothesizes that Denoising Diffusion Probabilistic Models are excellent regularizers for avoiding high-frequency and out-of-distribution perturbations.
arXiv Detail & Related papers (2023-03-17T13:34:38Z)
Feature Perturbation Augmentation for Reliable Evaluation of Importance Estimators in Neural Networks [5.439020425819001]
Post-hoc interpretability methods attempt to make the inner workings of deep neural networks more interpretable. One of the most popular evaluation frameworks is to perturb features deemed important by an interpretability method. We propose feature perturbation augmentation (FPA) which creates and adds perturbed images during the model training.
arXiv Detail & Related papers (2023-03-02T19:05:46Z)
Estimating the Adversarial Robustness of Attributions in Text with Transformers [44.745873282080346]
We establish a novel definition of attribution robustness (AR) in text classification, based on Lipschitz continuity. We then propose our novel TransformerExplanationAttack (TEA), a strong adversary that provides a tight estimation for attribution in text classification.
arXiv Detail & Related papers (2022-12-18T20:18:59Z)
Avoiding Inference Heuristics in Few-shot Prompt-based Finetuning [57.4036085386653]
We show that prompt-based models for sentence pair classification tasks still suffer from a common pitfall of adopting inferences based on lexical overlap. We then show that adding a regularization that preserves pretraining weights is effective in mitigating this destructive tendency of few-shot finetuning.
arXiv Detail & Related papers (2021-09-09T10:10:29Z)
Understanding and Diagnosing Vulnerability under Adversarial Attacks [62.661498155101654]
Deep Neural Networks (DNNs) are known to be vulnerable to adversarial attacks. We propose a novel interpretability method, InterpretGAN, to generate explanations for features used for classification in latent variables. We also design the first diagnostic method to quantify the vulnerability contributed by each layer.
arXiv Detail & Related papers (2020-07-17T01:56:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.