Fooling Explanations in Text Classifiers
- URL: http://arxiv.org/abs/2206.03178v1
- Date: Tue, 7 Jun 2022 10:58:08 GMT
- Title: Fooling Explanations in Text Classifiers
- Authors: Adam Ivankay, Ivan Girardi, Chiara Marchiori, Pascal Frossard
- Abstract summary: We introduceTextExplanationer (TEF), a novel explanation attack algorithm that alters text input samples imperceptibly.
TEF can significantly decrease the correlation between unchanged and perturbed input attributions.
We show that TEF perturbations are also effective in scenarios where the target model and explanation method are unknown.
- Score: 42.49606659285249
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State-of-the-art text classification models are becoming increasingly reliant
on deep neural networks (DNNs). Due to their black-box nature, faithful and
robust explanation methods need to accompany classifiers for deployment in
real-life scenarios. However, it has been shown in vision applications that
explanation methods are susceptible to local, imperceptible perturbations that
can significantly alter the explanations without changing the predicted
classes. We show here that the existence of such perturbations extends to text
classifiers as well. Specifically, we introduceTextExplanationFooler (TEF), a
novel explanation attack algorithm that alters text input samples imperceptibly
so that the outcome of widely-used explanation methods changes considerably
while leaving classifier predictions unchanged. We evaluate the performance of
the attribution robustness estimation performance in TEF on five sequence
classification datasets, utilizing three DNN architectures and three
transformer architectures for each dataset. TEF can significantly decrease the
correlation between unchanged and perturbed input attributions, which shows
that all models and explanation methods are susceptible to TEF perturbations.
Moreover, we evaluate how the perturbations transfer to other model
architectures and attribution methods, and show that TEF perturbations are also
effective in scenarios where the target model and explanation method are
unknown. Finally, we introduce a semi-universal attack that is able to compute
fast, computationally light perturbations with no knowledge of the attacked
classifier nor explanation method. Overall, our work shows that explanations in
text classifiers are very fragile and users need to carefully address their
robustness before relying on them in critical applications.
Related papers
- A Comparative Analysis of Counterfactual Explanation Methods for Text Classifiers [0.0]
We evaluate five methods for generating counterfactual explanations for a BERT text classifier.
established white-box substitution-based methods are effective at generating valid counterfactuals that change the classifier's output.
newer methods based on large language models (LLMs) excel at producing natural and linguistically plausible text counterfactuals.
arXiv Detail & Related papers (2024-11-04T22:01:52Z) - Explaining Text Similarity in Transformer Models [52.571158418102584]
Recent advances in explainable AI have made it possible to mitigate limitations by leveraging improved explanations for Transformers.
We use BiLRP, an extension developed for computing second-order explanations in bilinear similarity models, to investigate which feature interactions drive similarity in NLP models.
Our findings contribute to a deeper understanding of different semantic similarity tasks and models, highlighting how novel explainable AI methods enable in-depth analyses and corpus-level insights.
arXiv Detail & Related papers (2024-05-10T17:11:31Z) - Introducing User Feedback-based Counterfactual Explanations (UFCE) [49.1574468325115]
Counterfactual explanations (CEs) have emerged as a viable solution for generating comprehensible explanations in XAI.
UFCE allows for the inclusion of user constraints to determine the smallest modifications in the subset of actionable features.
UFCE outperforms two well-known CE methods in terms of textitproximity, textitsparsity, and textitfeasibility.
arXiv Detail & Related papers (2024-02-26T20:09:44Z) - Counterfactuals of Counterfactuals: a back-translation-inspired approach
to analyse counterfactual editors [3.4253416336476246]
We focus on the analysis of counterfactual, contrastive explanations.
We propose a new back translation-inspired evaluation methodology.
We show that by iteratively feeding the counterfactual to the explainer we can obtain valuable insights into the behaviour of both the predictor and the explainer models.
arXiv Detail & Related papers (2023-05-26T16:04:28Z) - Adversarial Counterfactual Visual Explanations [0.7366405857677227]
This paper proposes an elegant method to turn adversarial attacks into semantically meaningful perturbations.
The proposed approach hypothesizes that Denoising Diffusion Probabilistic Models are excellent regularizers for avoiding high-frequency and out-of-distribution perturbations.
arXiv Detail & Related papers (2023-03-17T13:34:38Z) - Feature Perturbation Augmentation for Reliable Evaluation of Importance
Estimators in Neural Networks [5.439020425819001]
Post-hoc interpretability methods attempt to make the inner workings of deep neural networks more interpretable.
One of the most popular evaluation frameworks is to perturb features deemed important by an interpretability method.
We propose feature perturbation augmentation (FPA) which creates and adds perturbed images during the model training.
arXiv Detail & Related papers (2023-03-02T19:05:46Z) - Estimating the Adversarial Robustness of Attributions in Text with
Transformers [44.745873282080346]
We establish a novel definition of attribution robustness (AR) in text classification, based on Lipschitz continuity.
We then propose our novel TransformerExplanationAttack (TEA), a strong adversary that provides a tight estimation for attribution in text classification.
arXiv Detail & Related papers (2022-12-18T20:18:59Z) - Avoiding Inference Heuristics in Few-shot Prompt-based Finetuning [57.4036085386653]
We show that prompt-based models for sentence pair classification tasks still suffer from a common pitfall of adopting inferences based on lexical overlap.
We then show that adding a regularization that preserves pretraining weights is effective in mitigating this destructive tendency of few-shot finetuning.
arXiv Detail & Related papers (2021-09-09T10:10:29Z) - Understanding and Diagnosing Vulnerability under Adversarial Attacks [62.661498155101654]
Deep Neural Networks (DNNs) are known to be vulnerable to adversarial attacks.
We propose a novel interpretability method, InterpretGAN, to generate explanations for features used for classification in latent variables.
We also design the first diagnostic method to quantify the vulnerability contributed by each layer.
arXiv Detail & Related papers (2020-07-17T01:56:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.