Interpretability and Transparency-Driven Detection and Transformation of
Textual Adversarial Examples (IT-DT)
- URL: http://arxiv.org/abs/2307.01225v1
- Date: Mon, 3 Jul 2023 03:17:20 GMT
- Title: Interpretability and Transparency-Driven Detection and Transformation of
Textual Adversarial Examples (IT-DT)
- Authors: Bushra Sabir, M. Ali Babar, Sharif Abuadbba
- Abstract summary: We propose the Interpretability and Transparency-Driven Detection and Transformation (IT-DT) framework.
It focuses on interpretability and transparency in detecting and transforming textual adversarial examples.
IT-DT significantly improves the resilience and trustworthiness of transformer-based text classifiers against adversarial attacks.
- Score: 0.5729426778193399
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based text classifiers like BERT, Roberta, T5, and GPT-3 have
shown impressive performance in NLP. However, their vulnerability to
adversarial examples poses a security risk. Existing defense methods lack
interpretability, making it hard to understand adversarial classifications and
identify model vulnerabilities. To address this, we propose the
Interpretability and Transparency-Driven Detection and Transformation (IT-DT)
framework. It focuses on interpretability and transparency in detecting and
transforming textual adversarial examples. IT-DT utilizes techniques like
attention maps, integrated gradients, and model feedback for interpretability
during detection. This helps identify salient features and perturbed words
contributing to adversarial classifications. In the transformation phase, IT-DT
uses pre-trained embeddings and model feedback to generate optimal replacements
for perturbed words. By finding suitable substitutions, we aim to convert
adversarial examples into non-adversarial counterparts that align with the
model's intended behavior while preserving the text's meaning. Transparency is
emphasized through human expert involvement. Experts review and provide
feedback on detection and transformation results, enhancing decision-making,
especially in complex scenarios. The framework generates insights and threat
intelligence empowering analysts to identify vulnerabilities and improve model
robustness. Comprehensive experiments demonstrate the effectiveness of IT-DT in
detecting and transforming adversarial examples. The approach enhances
interpretability, provides transparency, and enables accurate identification
and successful transformation of adversarial inputs. By combining technical
analysis and human expertise, IT-DT significantly improves the resilience and
trustworthiness of transformer-based text classifiers against adversarial
attacks.
Related papers
- Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language Attack [51.16384207202798]
Vision-language pre-training models are vulnerable to multimodal adversarial examples (AEs)
Previous approaches augment image-text pairs to enhance diversity within the adversarial example generation process.
We propose sampling from adversarial evolution triangles composed of clean, historical, and current adversarial examples to enhance adversarial diversity.
arXiv Detail & Related papers (2024-11-04T23:07:51Z) - Con-ReCall: Detecting Pre-training Data in LLMs via Contrastive Decoding [118.75567341513897]
Existing methods typically analyze target text in isolation or solely with non-member contexts.
We propose Con-ReCall, a novel approach that leverages the asymmetric distributional shifts induced by member and non-member contexts.
arXiv Detail & Related papers (2024-09-05T09:10:38Z) - SA-Attack: Improving Adversarial Transferability of Vision-Language
Pre-training Models via Self-Augmentation [56.622250514119294]
In contrast to white-box adversarial attacks, transfer attacks are more reflective of real-world scenarios.
We propose a self-augment-based transfer attack method, termed SA-Attack.
arXiv Detail & Related papers (2023-12-08T09:08:50Z) - Lost In Translation: Generating Adversarial Examples Robust to
Round-Trip Translation [66.33340583035374]
We present a comprehensive study on the robustness of current text adversarial attacks to round-trip translation.
We demonstrate that 6 state-of-the-art text-based adversarial attacks do not maintain their efficacy after round-trip translation.
We introduce an intervention-based solution to this problem, by integrating Machine Translation into the process of adversarial example generation.
arXiv Detail & Related papers (2023-07-24T04:29:43Z) - In and Out-of-Domain Text Adversarial Robustness via Label Smoothing [64.66809713499576]
We study the adversarial robustness provided by various label smoothing strategies in foundational models for diverse NLP tasks.
Our experiments show that label smoothing significantly improves adversarial robustness in pre-trained models like BERT, against various popular attacks.
We also analyze the relationship between prediction confidence and robustness, showing that label smoothing reduces over-confident errors on adversarial examples.
arXiv Detail & Related papers (2022-12-20T14:06:50Z) - Estimating the Adversarial Robustness of Attributions in Text with
Transformers [44.745873282080346]
We establish a novel definition of attribution robustness (AR) in text classification, based on Lipschitz continuity.
We then propose our novel TransformerExplanationAttack (TEA), a strong adversary that provides a tight estimation for attribution in text classification.
arXiv Detail & Related papers (2022-12-18T20:18:59Z) - Disentangled Text Representation Learning with Information-Theoretic
Perspective for Adversarial Robustness [17.5771010094384]
Adversarial vulnerability remains a major obstacle to constructing reliable NLP systems.
Recent work argues the adversarial vulnerability of the model is caused by the non-robust features in supervised training.
In this paper, we tackle the adversarial challenge from the view of disentangled representation learning.
arXiv Detail & Related papers (2022-10-26T18:14:39Z) - Beyond Model Interpretability: On the Faithfulness and Adversarial
Robustness of Contrastive Textual Explanations [2.543865489517869]
This work motivates textual counterfactuals by laying the ground for a novel evaluation scheme inspired by the faithfulness of explanations.
Experiments on sentiment analysis data show that the connectedness of counterfactuals to their original counterparts is not obvious in both models.
arXiv Detail & Related papers (2022-10-17T09:50:02Z) - Semantically Distributed Robust Optimization for Vision-and-Language
Inference [34.83271008148651]
We present textbfSDRO, a model-agnostic method that utilizes a set linguistic transformations in a distributed robust optimization setting.
Experiments on benchmark datasets with images and video demonstrate performance improvements as well as robustness to adversarial attacks.
arXiv Detail & Related papers (2021-10-14T06:02:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.