Textual Manifold-based Defense Against Natural Language Adversarial
Examples
- URL: http://arxiv.org/abs/2211.02878v1
- Date: Sat, 5 Nov 2022 11:19:47 GMT
- Title: Textual Manifold-based Defense Against Natural Language Adversarial
Examples
- Authors: Dang Minh Nguyen, Luu Anh Tuan
- Abstract summary: We find that adversarial texts tend to have their embeddings diverge from the manifold of natural ones.
We propose Textual Manifold-based Defense (TMD), a defense mechanism that projects text embeddings onto an approximated embedding manifold.
Our method consistently and significantly outperforms previous defenses without trading off clean accuracy.
- Score: 10.140147080535222
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies on adversarial images have shown that they tend to leave the
underlying low-dimensional data manifold, making them significantly more
challenging for current models to make correct predictions. This so-called
off-manifold conjecture has inspired a novel line of defenses against
adversarial attacks on images. In this study, we find a similar phenomenon
occurs in the contextualized embedding space induced by pretrained language
models, in which adversarial texts tend to have their embeddings diverge from
the manifold of natural ones. Based on this finding, we propose Textual
Manifold-based Defense (TMD), a defense mechanism that projects text embeddings
onto an approximated embedding manifold before classification. It reduces the
complexity of potential adversarial examples, which ultimately enhances the
robustness of the protected model. Through extensive experiments, our method
consistently and significantly outperforms previous defenses under various
attack settings without trading off clean accuracy. To the best of our
knowledge, this is the first NLP defense that leverages the manifold structure
against adversarial attacks. Our code is available at
\url{https://github.com/dangne/tmd}.
Related papers
- DiffuseDef: Improved Robustness to Adversarial Attacks [38.34642687239535]
adversarial attacks pose a critical challenge to system built using pretrained language models.
We propose DiffuseDef, which incorporates a diffusion layer as a denoiser between the encoder and the classifier.
During inference, the adversarial hidden state is first combined with sampled noise, then denoised iteratively and finally ensembled to produce a robust text representation.
arXiv Detail & Related papers (2024-06-28T22:36:17Z) - Attacking Byzantine Robust Aggregation in High Dimensions [13.932039723114299]
Training modern neural networks or models typically requires averaging over a sample of high-dimensional vectors.
Poisoning attacks can skew or bias the average vectors used to train the model, forcing the model to learn specific patterns or avoid learning anything useful.
We show a new attack called HIDRA on practical realization of strong defenses which subverts their claim of dimension-independent bias.
arXiv Detail & Related papers (2023-12-22T06:25:46Z) - Fooling the Textual Fooler via Randomizing Latent Representations [13.77424820701913]
adversarial word-level perturbations are well-studied and effective attack strategies.
We propose a lightweight and attack-agnostic defense whose main goal is to perplex the process of generating an adversarial example.
We empirically demonstrate near state-of-the-art robustness of AdvFooler against representative adversarial word-level attacks.
arXiv Detail & Related papers (2023-10-02T06:57:25Z) - TextDefense: Adversarial Text Detection based on Word Importance Entropy [38.632552667871295]
We propose TextDefense, a new adversarial example detection framework for NLP models.
Our experiments show that TextDefense can be applied to different architectures, datasets, and attack methods.
We provide our insights into the adversarial attacks in NLP and the principles of our defense method.
arXiv Detail & Related papers (2023-02-12T11:12:44Z) - In and Out-of-Domain Text Adversarial Robustness via Label Smoothing [64.66809713499576]
We study the adversarial robustness provided by various label smoothing strategies in foundational models for diverse NLP tasks.
Our experiments show that label smoothing significantly improves adversarial robustness in pre-trained models like BERT, against various popular attacks.
We also analyze the relationship between prediction confidence and robustness, showing that label smoothing reduces over-confident errors on adversarial examples.
arXiv Detail & Related papers (2022-12-20T14:06:50Z) - ADDMU: Detection of Far-Boundary Adversarial Examples with Data and
Model Uncertainty Estimation [125.52743832477404]
Adversarial Examples Detection (AED) is a crucial defense technique against adversarial attacks.
We propose a new technique, textbfADDMU, which combines two types of uncertainty estimation for both regular and FB adversarial example detection.
Our new method outperforms previous methods by 3.6 and 6.0 emphAUC points under each scenario.
arXiv Detail & Related papers (2022-10-22T09:11:12Z) - Rethinking Textual Adversarial Defense for Pre-trained Language Models [79.18455635071817]
A literature review shows that pre-trained language models (PrLMs) are vulnerable to adversarial attacks.
We propose a novel metric (Degree of Anomaly) to enable current adversarial attack approaches to generate more natural and imperceptible adversarial examples.
We show that our universal defense framework achieves comparable or even higher after-attack accuracy with other specific defenses.
arXiv Detail & Related papers (2022-07-21T07:51:45Z) - Searching for an Effective Defender: Benchmarking Defense against
Adversarial Word Substitution [83.84968082791444]
Deep neural networks are vulnerable to intentionally crafted adversarial examples.
Various methods have been proposed to defend against adversarial word-substitution attacks for neural NLP models.
arXiv Detail & Related papers (2021-08-29T08:11:36Z) - Towards Variable-Length Textual Adversarial Attacks [68.27995111870712]
It is non-trivial to conduct textual adversarial attacks on natural language processing tasks due to the discreteness of data.
In this paper, we propose variable-length textual adversarial attacks(VL-Attack)
Our method can achieve $33.18$ BLEU score on IWSLT14 German-English translation, achieving an improvement of $1.47$ over the baseline model.
arXiv Detail & Related papers (2021-04-16T14:37:27Z) - Adversarial Example Games [51.92698856933169]
Adrial Example Games (AEG) is a framework that models the crafting of adversarial examples.
AEG provides a new way to design adversarial examples by adversarially training a generator and aversa from a given hypothesis class.
We demonstrate the efficacy of AEG on the MNIST and CIFAR-10 datasets.
arXiv Detail & Related papers (2020-07-01T19:47:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.