Detecting Textual Adversarial Examples Based on Distributional
Characteristics of Data Representations
- URL: http://arxiv.org/abs/2204.13853v1
- Date: Fri, 29 Apr 2022 02:32:02 GMT
- Title: Detecting Textual Adversarial Examples Based on Distributional
Characteristics of Data Representations
- Authors: Na Liu, Mark Dras, Wei Emma Zhang
- Abstract summary: adversarial examples are constructed by adding small non-random perturbations to correctly classified inputs.
Approaches to adversarial attacks in natural language tasks have boomed in the last five years using character-level, word-level, or phrase-level perturbations.
We propose two new reactive methods for NLP to fill this gap.
Adapted LID and MDRE obtain state-of-the-art results on character-level, word-level, and phrase-level attacks on the IMDB dataset.
- Score: 11.93653349589025
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Although deep neural networks have achieved state-of-the-art performance in
various machine learning tasks, adversarial examples, constructed by adding
small non-random perturbations to correctly classified inputs, successfully
fool highly expressive deep classifiers into incorrect predictions. Approaches
to adversarial attacks in natural language tasks have boomed in the last five
years using character-level, word-level, phrase-level, or sentence-level
textual perturbations. While there is some work in NLP on defending against
such attacks through proactive methods, like adversarial training, there is to
our knowledge no effective general reactive approaches to defence via detection
of textual adversarial examples such as is found in the image processing
literature. In this paper, we propose two new reactive methods for NLP to fill
this gap, which unlike the few limited application baselines from NLP are based
entirely on distribution characteristics of learned representations: we adapt
one from the image processing literature (Local Intrinsic Dimensionality
(LID)), and propose a novel one (MultiDistance Representation Ensemble Method
(MDRE)). Adapted LID and MDRE obtain state-of-the-art results on
character-level, word-level, and phrase-level attacks on the IMDB dataset as
well as on the later two with respect to the MultiNLI dataset. For future
research, we publish our code.
Related papers
- SemRoDe: Macro Adversarial Training to Learn Representations That are Robust to Word-Level Attacks [29.942001958562567]
We propose a novel approach called Semantic Robust Defence (SemRoDeversa) to enhance the robustness of language models.
Our method learns a robust representation that bridges these two domains.
The results demonstrate promising state-of-the-art robustness.
arXiv Detail & Related papers (2024-03-27T10:24:25Z) - TextDefense: Adversarial Text Detection based on Word Importance Entropy [38.632552667871295]
We propose TextDefense, a new adversarial example detection framework for NLP models.
Our experiments show that TextDefense can be applied to different architectures, datasets, and attack methods.
We provide our insights into the adversarial attacks in NLP and the principles of our defense method.
arXiv Detail & Related papers (2023-02-12T11:12:44Z) - Learning-based Hybrid Local Search for the Hard-label Textual Attack [53.92227690452377]
We consider a rarely investigated but more rigorous setting, namely hard-label attack, in which the attacker could only access the prediction label.
Based on this observation, we propose a novel hard-label attack, called Learning-based Hybrid Local Search (LHLS) algorithm.
Our LHLS significantly outperforms existing hard-label attacks regarding the attack performance as well as adversary quality.
arXiv Detail & Related papers (2022-01-20T14:16:07Z) - Phrase-level Adversarial Example Generation for Neural Machine
Translation [75.01476479100569]
We propose a phrase-level adversarial example generation (PAEG) method to enhance the robustness of the model.
We verify our method on three benchmarks, including LDC Chinese-English, IWSLT14 German-English, and WMT14 English-German tasks.
arXiv Detail & Related papers (2022-01-06T11:00:49Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - Bridge the Gap Between CV and NLP! A Gradient-based Textual Adversarial
Attack Framework [17.17479625646699]
We propose a unified framework to craft textual adversarial samples.
In this paper, we instantiate our framework with an attack algorithm named Textual Projected Gradient Descent (T-PGD)
arXiv Detail & Related papers (2021-10-28T17:31:51Z) - Searching for an Effective Defender: Benchmarking Defense against
Adversarial Word Substitution [83.84968082791444]
Deep neural networks are vulnerable to intentionally crafted adversarial examples.
Various methods have been proposed to defend against adversarial word-substitution attacks for neural NLP models.
arXiv Detail & Related papers (2021-08-29T08:11:36Z) - Defense against Adversarial Attacks in NLP via Dirichlet Neighborhood
Ensemble [163.3333439344695]
Dirichlet Neighborhood Ensemble (DNE) is a randomized smoothing method for training a robust model to defense substitution-based attacks.
DNE forms virtual sentences by sampling embedding vectors for each word in an input sentence from a convex hull spanned by the word and its synonyms, and it augments them with the training data.
We demonstrate through extensive experimentation that our method consistently outperforms recently proposed defense methods by a significant margin across different network architectures and multiple data sets.
arXiv Detail & Related papers (2020-06-20T18:01:16Z) - Defense of Word-level Adversarial Attacks via Random Substitution
Encoding [0.5964792400314836]
adversarial attacks against deep neural networks on computer vision tasks have spawned many new technologies that help protect models from avoiding false predictions.
Recently, word-level adversarial attacks on deep models of Natural Language Processing (NLP) tasks have also demonstrated strong power, e.g., fooling a sentiment classification neural network to make wrong decisions.
We propose a novel framework called Random Substitution RSE, which introduces a random substitution into the training process of original neural networks.
arXiv Detail & Related papers (2020-05-01T15:28:43Z) - BERT-ATTACK: Adversarial Attack Against BERT Using BERT [77.82947768158132]
Adrial attacks for discrete data (such as texts) are more challenging than continuous data (such as images)
We propose textbfBERT-Attack, a high-quality and effective method to generate adversarial samples.
Our method outperforms state-of-the-art attack strategies in both success rate and perturb percentage.
arXiv Detail & Related papers (2020-04-21T13:30:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.