Contrasting Human- and Machine-Generated Word-Level Adversarial Examples
for Text Classification
- URL: http://arxiv.org/abs/2109.04385v1
- Date: Thu, 9 Sep 2021 16:16:04 GMT
- Title: Contrasting Human- and Machine-Generated Word-Level Adversarial Examples
for Text Classification
- Authors: Maximilian Mozes, Max Bartolo, Pontus Stenetorp, Bennett Kleinberg,
Lewis D. Griffin
- Abstract summary: We report on crowdsourcing studies in which we task humans with iteratively modifying words in an input text.
We analyze how human-generated adversarial examples compare to the recently proposed TextFooler, Genetic, BAE and SememePSO attack algorithms.
- Score: 12.750016480098262
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Research shows that natural language processing models are generally
considered to be vulnerable to adversarial attacks; but recent work has drawn
attention to the issue of validating these adversarial inputs against certain
criteria (e.g., the preservation of semantics and grammaticality). Enforcing
constraints to uphold such criteria may render attacks unsuccessful, raising
the question of whether valid attacks are actually feasible. In this work, we
investigate this through the lens of human language ability. We report on
crowdsourcing studies in which we task humans with iteratively modifying words
in an input text, while receiving immediate model feedback, with the aim of
causing a sentiment classification model to misclassify the example. Our
findings suggest that humans are capable of generating a substantial amount of
adversarial examples using semantics-preserving word substitutions. We analyze
how human-generated adversarial examples compare to the recently proposed
TextFooler, Genetic, BAE and SememePSO attack algorithms on the dimensions
naturalness, preservation of sentiment, grammaticality and substitution rate.
Our findings suggest that human-generated adversarial examples are not more
able than the best algorithms to generate natural-reading, sentiment-preserving
examples, though they do so by being much more computationally efficient.
Related papers
- How do humans perceive adversarial text? A reality check on the validity
and naturalness of word-based adversarial attacks [4.297786261992324]
adversarial attacks are malicious algorithms that imperceptibly modify input text to force models into making incorrect predictions.
We surveyed 378 human participants about the perceptibility of text adversarial examples produced by state-of-the-art methods.
Our results underline that existing text attacks are impractical in real-world scenarios where humans are involved.
arXiv Detail & Related papers (2023-05-24T21:52:13Z) - NaturalAdversaries: Can Naturalistic Adversaries Be as Effective as
Artificial Adversaries? [61.58261351116679]
We introduce a two-stage adversarial example generation framework (NaturalAdversaries) for natural language understanding tasks.
It is adaptable to both black-box and white-box adversarial attacks based on the level of access to the model parameters.
Our results indicate these adversaries generalize across domains, and offer insights for future research on improving robustness of neural text classification models.
arXiv Detail & Related papers (2022-11-08T16:37:34Z) - Identifying Human Strategies for Generating Word-Level Adversarial
Examples [7.504832901086077]
Previous work found that human- and machine-generated adversarial examples are comparable in their naturalness and grammatical correctness.
This paper provides a detailed analysis of exactly how humans create these adversarial examples.
arXiv Detail & Related papers (2022-10-20T21:16:44Z) - Learning-based Hybrid Local Search for the Hard-label Textual Attack [53.92227690452377]
We consider a rarely investigated but more rigorous setting, namely hard-label attack, in which the attacker could only access the prediction label.
Based on this observation, we propose a novel hard-label attack, called Learning-based Hybrid Local Search (LHLS) algorithm.
Our LHLS significantly outperforms existing hard-label attacks regarding the attack performance as well as adversary quality.
arXiv Detail & Related papers (2022-01-20T14:16:07Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - Towards Defending against Adversarial Examples via Attack-Invariant
Features [147.85346057241605]
Deep neural networks (DNNs) are vulnerable to adversarial noise.
adversarial robustness can be improved by exploiting adversarial examples.
Models trained on seen types of adversarial examples generally cannot generalize well to unseen types of adversarial examples.
arXiv Detail & Related papers (2021-06-09T12:49:54Z) - On the Transferability of Adversarial Attacksagainst Neural Text
Classifier [121.6758865857686]
We investigate the transferability of adversarial examples for text classification models.
We propose a genetic algorithm to find an ensemble of models that can induce adversarial examples to fool almost all existing models.
We derive word replacement rules that can be used for model diagnostics from these adversarial examples.
arXiv Detail & Related papers (2020-11-17T10:45:05Z) - Contextualized Perturbation for Textual Adversarial Attack [56.370304308573274]
Adversarial examples expose the vulnerabilities of natural language processing (NLP) models.
This paper presents CLARE, a ContextuaLized AdversaRial Example generation model that produces fluent and grammatical outputs.
arXiv Detail & Related papers (2020-09-16T06:53:15Z) - Universal Adversarial Attacks with Natural Triggers for Text
Classification [30.74579821832117]
We develop adversarial attacks that appear closer to natural English phrases and yet confuse classification systems.
Our attacks effectively reduce model accuracy on classification tasks while being less identifiable than prior models.
arXiv Detail & Related papers (2020-05-01T01:58:24Z) - BAE: BERT-based Adversarial Examples for Text Classification [9.188318506016898]
We present BAE, a black box attack for generating adversarial examples using contextual perturbations from a BERT masked language model.
We show that BAE performs a stronger attack, in addition to generating adversarial examples with improved grammaticality and semantic coherence as compared to prior work.
arXiv Detail & Related papers (2020-04-04T16:25:48Z) - Generating Natural Language Adversarial Examples on a Large Scale with
Generative Models [41.85006993382117]
We propose an end to end solution to efficiently generate adversarial texts from scratch using generative models.
Specifically, we train a conditional variational autoencoder with an additional adversarial loss to guide the generation of adversarial examples.
To improve the validity of adversarial texts, we utilize discrimators and the training framework of generative adversarial networks.
arXiv Detail & Related papers (2020-03-10T03:21:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.