A Geometry-Inspired Attack for Generating Natural Language Adversarial
Examples
- URL: http://arxiv.org/abs/2010.01345v1
- Date: Sat, 3 Oct 2020 12:58:47 GMT
- Title: A Geometry-Inspired Attack for Generating Natural Language Adversarial
Examples
- Authors: Zhao Meng, Roger Wattenhofer
- Abstract summary: We propose a geometry-inspired attack for generating natural language adversarial examples.
Our attack fools natural language models with high success rates, while only replacing a few words.
Further experiments show that adversarial training can improve model robustness against our attack.
- Score: 13.427128424538505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating adversarial examples for natural language is hard, as natural
language consists of discrete symbols, and examples are often of variable
lengths. In this paper, we propose a geometry-inspired attack for generating
natural language adversarial examples. Our attack generates adversarial
examples by iteratively approximating the decision boundary of Deep Neural
Networks (DNNs). Experiments on two datasets with two different models show
that our attack fools natural language models with high success rates, while
only replacing a few words. Human evaluation shows that adversarial examples
generated by our attack are hard for humans to recognize. Further experiments
show that adversarial training can improve model robustness against our attack.
Related papers
- Generating Valid and Natural Adversarial Examples with Large Language
Models [18.944937459278197]
adversarial attack models are not valid nor natural, leading to the loss of semantic maintenance, grammaticality, and human imperceptibility.
We propose LLM-Attack, which aims at generating both valid and natural adversarial examples with large language models.
Experimental results on the Movie Review (MR), IMDB, and Review Polarity datasets against the baseline adversarial attack models illustrate the effectiveness of LLM-Attack.
arXiv Detail & Related papers (2023-11-20T15:57:04Z) - Context-aware Adversarial Attack on Named Entity Recognition [15.049160192547909]
We study context-aware adversarial attack methods to examine the model's robustness.
Specifically, we propose perturbing the most informative words for recognizing entities to create adversarial examples.
Experiments and analyses show that our methods are more effective in deceiving the model into making wrong predictions than strong baselines.
arXiv Detail & Related papers (2023-09-16T14:04:23Z) - NaturalAdversaries: Can Naturalistic Adversaries Be as Effective as
Artificial Adversaries? [61.58261351116679]
We introduce a two-stage adversarial example generation framework (NaturalAdversaries) for natural language understanding tasks.
It is adaptable to both black-box and white-box adversarial attacks based on the level of access to the model parameters.
Our results indicate these adversaries generalize across domains, and offer insights for future research on improving robustness of neural text classification models.
arXiv Detail & Related papers (2022-11-08T16:37:34Z) - The Enemy of My Enemy is My Friend: Exploring Inverse Adversaries for
Improving Adversarial Training [72.39526433794707]
Adversarial training and its variants have been shown to be the most effective approaches to defend against adversarial examples.
We propose a novel adversarial training scheme that encourages the model to produce similar outputs for an adversarial example and its inverse adversarial'' counterpart.
Our training method achieves state-of-the-art robustness as well as natural accuracy.
arXiv Detail & Related papers (2022-11-01T15:24:26Z) - Identifying Human Strategies for Generating Word-Level Adversarial
Examples [7.504832901086077]
Previous work found that human- and machine-generated adversarial examples are comparable in their naturalness and grammatical correctness.
This paper provides a detailed analysis of exactly how humans create these adversarial examples.
arXiv Detail & Related papers (2022-10-20T21:16:44Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - Contrasting Human- and Machine-Generated Word-Level Adversarial Examples
for Text Classification [12.750016480098262]
We report on crowdsourcing studies in which we task humans with iteratively modifying words in an input text.
We analyze how human-generated adversarial examples compare to the recently proposed TextFooler, Genetic, BAE and SememePSO attack algorithms.
arXiv Detail & Related papers (2021-09-09T16:16:04Z) - A Differentiable Language Model Adversarial Attack on Text Classifiers [10.658675415759697]
We propose a new black-box sentence-level attack for natural language processing.
Our method fine-tunes a pre-trained language model to generate adversarial examples.
We show that the proposed attack outperforms competitors on a diverse set of NLP problems for both computed metrics and human evaluation.
arXiv Detail & Related papers (2021-07-23T14:43:13Z) - Towards Defending against Adversarial Examples via Attack-Invariant
Features [147.85346057241605]
Deep neural networks (DNNs) are vulnerable to adversarial noise.
adversarial robustness can be improved by exploiting adversarial examples.
Models trained on seen types of adversarial examples generally cannot generalize well to unseen types of adversarial examples.
arXiv Detail & Related papers (2021-06-09T12:49:54Z) - On the Transferability of Adversarial Attacksagainst Neural Text
Classifier [121.6758865857686]
We investigate the transferability of adversarial examples for text classification models.
We propose a genetic algorithm to find an ensemble of models that can induce adversarial examples to fool almost all existing models.
We derive word replacement rules that can be used for model diagnostics from these adversarial examples.
arXiv Detail & Related papers (2020-11-17T10:45:05Z) - Learning to Attack: Towards Textual Adversarial Attacking in Real-world
Situations [81.82518920087175]
Adversarial attacking aims to fool deep neural networks with adversarial examples.
We propose a reinforcement learning based attack model, which can learn from attack history and launch attacks more efficiently.
arXiv Detail & Related papers (2020-09-19T09:12:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.