Related papers: Interpretability and Transparency-Driven Detection and Transformation of Textual Adversarial Examples (IT-DT)

Interpretability and Transparency-Driven Detection and Transformation of Textual Adversarial Examples (IT-DT)

URL: http://arxiv.org/abs/2307.01225v1
Date: Mon, 3 Jul 2023 03:17:20 GMT
Title: Interpretability and Transparency-Driven Detection and Transformation of Textual Adversarial Examples (IT-DT)
Authors: Bushra Sabir, M. Ali Babar, Sharif Abuadbba
Abstract summary: We propose the Interpretability and Transparency-Driven Detection and Transformation (IT-DT) framework. It focuses on interpretability and transparency in detecting and transforming textual adversarial examples. IT-DT significantly improves the resilience and trustworthiness of transformer-based text classifiers against adversarial attacks.
Score: 0.5729426778193399
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformer-based text classifiers like BERT, Roberta, T5, and GPT-3 have shown impressive performance in NLP. However, their vulnerability to adversarial examples poses a security risk. Existing defense methods lack interpretability, making it hard to understand adversarial classifications and identify model vulnerabilities. To address this, we propose the Interpretability and Transparency-Driven Detection and Transformation (IT-DT) framework. It focuses on interpretability and transparency in detecting and transforming textual adversarial examples. IT-DT utilizes techniques like attention maps, integrated gradients, and model feedback for interpretability during detection. This helps identify salient features and perturbed words contributing to adversarial classifications. In the transformation phase, IT-DT uses pre-trained embeddings and model feedback to generate optimal replacements for perturbed words. By finding suitable substitutions, we aim to convert adversarial examples into non-adversarial counterparts that align with the model's intended behavior while preserving the text's meaning. Transparency is emphasized through human expert involvement. Experts review and provide feedback on detection and transformation results, enhancing decision-making, especially in complex scenarios. The framework generates insights and threat intelligence empowering analysts to identify vulnerabilities and improve model robustness. Comprehensive experiments demonstrate the effectiveness of IT-DT in detecting and transforming adversarial examples. The approach enhances interpretability, provides transparency, and enables accurate identification and successful transformation of adversarial inputs. By combining technical analysis and human expertise, IT-DT significantly improves the resilience and trustworthiness of transformer-based text classifiers against adversarial attacks.

Related papers

On the Mechanisms of Adversarial Data Augmentation for Robust and Adaptive Transfer Learning [0.0]
We investigate the role of adversarial data augmentation (ADA) in enhancing both robustness and adaptivity in transfer learning settings.<n>We propose a unified framework that integrates ADA with consistency regularization and domain-invariant representation learning.<n>Our results highlight a constructive perspective of adversarial learning, transforming perturbation from a destructive attack into a regularizing force for cross-domain transferability.
arXiv Detail & Related papers (2025-05-19T03:56:51Z)
Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language Attack [51.16384207202798]
Vision-language pre-training models are vulnerable to multimodal adversarial examples (AEs) Previous approaches augment image-text pairs to enhance diversity within the adversarial example generation process. We propose sampling from adversarial evolution triangles composed of clean, historical, and current adversarial examples to enhance adversarial diversity.
arXiv Detail & Related papers (2024-11-04T23:07:51Z)
Con-ReCall: Detecting Pre-training Data in LLMs via Contrastive Decoding [118.75567341513897]
Existing methods typically analyze target text in isolation or solely with non-member contexts. We propose Con-ReCall, a novel approach that leverages the asymmetric distributional shifts induced by member and non-member contexts.
arXiv Detail & Related papers (2024-09-05T09:10:38Z)
SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation [56.622250514119294]
In contrast to white-box adversarial attacks, transfer attacks are more reflective of real-world scenarios. We propose a self-augment-based transfer attack method, termed SA-Attack.
arXiv Detail & Related papers (2023-12-08T09:08:50Z)
Lost In Translation: Generating Adversarial Examples Robust to Round-Trip Translation [66.33340583035374]
We present a comprehensive study on the robustness of current text adversarial attacks to round-trip translation. We demonstrate that 6 state-of-the-art text-based adversarial attacks do not maintain their efficacy after round-trip translation. We introduce an intervention-based solution to this problem, by integrating Machine Translation into the process of adversarial example generation.
arXiv Detail & Related papers (2023-07-24T04:29:43Z)
In and Out-of-Domain Text Adversarial Robustness via Label Smoothing [64.66809713499576]
We study the adversarial robustness provided by various label smoothing strategies in foundational models for diverse NLP tasks. Our experiments show that label smoothing significantly improves adversarial robustness in pre-trained models like BERT, against various popular attacks. We also analyze the relationship between prediction confidence and robustness, showing that label smoothing reduces over-confident errors on adversarial examples.
arXiv Detail & Related papers (2022-12-20T14:06:50Z)
Estimating the Adversarial Robustness of Attributions in Text with Transformers [44.745873282080346]
We establish a novel definition of attribution robustness (AR) in text classification, based on Lipschitz continuity. We then propose our novel TransformerExplanationAttack (TEA), a strong adversary that provides a tight estimation for attribution in text classification.
arXiv Detail & Related papers (2022-12-18T20:18:59Z)
Disentangled Text Representation Learning with Information-Theoretic Perspective for Adversarial Robustness [17.5771010094384]
Adversarial vulnerability remains a major obstacle to constructing reliable NLP systems. Recent work argues the adversarial vulnerability of the model is caused by the non-robust features in supervised training. In this paper, we tackle the adversarial challenge from the view of disentangled representation learning.
arXiv Detail & Related papers (2022-10-26T18:14:39Z)
Beyond Model Interpretability: On the Faithfulness and Adversarial Robustness of Contrastive Textual Explanations [2.543865489517869]
This work motivates textual counterfactuals by laying the ground for a novel evaluation scheme inspired by the faithfulness of explanations. Experiments on sentiment analysis data show that the connectedness of counterfactuals to their original counterparts is not obvious in both models.
arXiv Detail & Related papers (2022-10-17T09:50:02Z)
Semantically Distributed Robust Optimization for Vision-and-Language Inference [34.83271008148651]
We present textbfSDRO, a model-agnostic method that utilizes a set linguistic transformations in a distributed robust optimization setting. Experiments on benchmark datasets with images and video demonstrate performance improvements as well as robustness to adversarial attacks.
arXiv Detail & Related papers (2021-10-14T06:02:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.