Universal Adversarial Attacks with Natural Triggers for Text
Classification
- URL: http://arxiv.org/abs/2005.00174v2
- Date: Wed, 7 Apr 2021 21:03:23 GMT
- Title: Universal Adversarial Attacks with Natural Triggers for Text
Classification
- Authors: Liwei Song, Xinwei Yu, Hsuan-Tung Peng, Karthik Narasimhan
- Abstract summary: We develop adversarial attacks that appear closer to natural English phrases and yet confuse classification systems.
Our attacks effectively reduce model accuracy on classification tasks while being less identifiable than prior models.
- Score: 30.74579821832117
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work has demonstrated the vulnerability of modern text classifiers to
universal adversarial attacks, which are input-agnostic sequences of words
added to text processed by classifiers. Despite being successful, the word
sequences produced in such attacks are often ungrammatical and can be easily
distinguished from natural text. We develop adversarial attacks that appear
closer to natural English phrases and yet confuse classification systems when
added to benign inputs. We leverage an adversarially regularized autoencoder
(ARAE) to generate triggers and propose a gradient-based search that aims to
maximize the downstream classifier's prediction loss. Our attacks effectively
reduce model accuracy on classification tasks while being less identifiable
than prior models as per automatic detection metrics and human-subject studies.
Our aim is to demonstrate that adversarial attacks can be made harder to detect
than previously thought and to enable the development of appropriate defenses.
Related papers
- Humanizing Machine-Generated Content: Evading AI-Text Detection through Adversarial Attack [24.954755569786396]
We propose a framework for a broader class of adversarial attacks, designed to perform minor perturbations in machine-generated content to evade detection.
We consider two attack settings: white-box and black-box, and employ adversarial learning in dynamic scenarios to assess the potential enhancement of the current detection model's robustness.
The empirical results reveal that the current detection models can be compromised in as little as 10 seconds, leading to the misclassification of machine-generated text as human-written content.
arXiv Detail & Related papers (2024-04-02T12:49:22Z) - Adversarial Text Purification: A Large Language Model Approach for
Defense [25.041109219049442]
Adversarial purification is a defense mechanism for safeguarding classifiers against adversarial attacks.
We propose a novel adversarial text purification that harnesses the generative capabilities of Large Language Models.
Our proposed method demonstrates remarkable performance over various classifiers, improving their accuracy under the attack by over 65% on average.
arXiv Detail & Related papers (2024-02-05T02:36:41Z) - Token-Level Adversarial Prompt Detection Based on Perplexity Measures
and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks.
This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs.
We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z) - Order-Disorder: Imitation Adversarial Attacks for Black-box Neural
Ranking Models [48.93128542994217]
We propose an imitation adversarial attack on black-box neural passage ranking models.
We show that the target passage ranking model can be transparentized and imitated by enumerating critical queries/candidates.
We also propose an innovative gradient-based attack method, empowered by the pairwise objective function, to generate adversarial triggers.
arXiv Detail & Related papers (2022-09-14T09:10:07Z) - Rethinking Textual Adversarial Defense for Pre-trained Language Models [79.18455635071817]
A literature review shows that pre-trained language models (PrLMs) are vulnerable to adversarial attacks.
We propose a novel metric (Degree of Anomaly) to enable current adversarial attack approaches to generate more natural and imperceptible adversarial examples.
We show that our universal defense framework achieves comparable or even higher after-attack accuracy with other specific defenses.
arXiv Detail & Related papers (2022-07-21T07:51:45Z) - "That Is a Suspicious Reaction!": Interpreting Logits Variation to
Detect NLP Adversarial Attacks [0.2999888908665659]
Adversarial attacks are a major challenge faced by current machine learning research.
Our work presents a model-agnostic detector of adversarial text examples.
arXiv Detail & Related papers (2022-04-10T09:24:41Z) - Zero-Query Transfer Attacks on Context-Aware Object Detectors [95.18656036716972]
Adversarial attacks perturb images such that a deep neural network produces incorrect classification results.
A promising approach to defend against adversarial attacks on natural multi-object scenes is to impose a context-consistency check.
We present the first approach for generating context-consistent adversarial attacks that can evade the context-consistency check.
arXiv Detail & Related papers (2022-03-29T04:33:06Z) - Learning-based Hybrid Local Search for the Hard-label Textual Attack [53.92227690452377]
We consider a rarely investigated but more rigorous setting, namely hard-label attack, in which the attacker could only access the prediction label.
Based on this observation, we propose a novel hard-label attack, called Learning-based Hybrid Local Search (LHLS) algorithm.
Our LHLS significantly outperforms existing hard-label attacks regarding the attack performance as well as adversary quality.
arXiv Detail & Related papers (2022-01-20T14:16:07Z) - Generating Label Cohesive and Well-Formed Adversarial Claims [44.29895319592488]
Adversarial attacks reveal important vulnerabilities and flaws of trained models.
We investigate how to generate adversarial attacks against fact checking systems that preserve the ground truth meaning.
We find that the generated attacks maintain the directionality and semantic validity of the claim better than previous work.
arXiv Detail & Related papers (2020-09-17T10:50:42Z) - Temporal Sparse Adversarial Attack on Sequence-based Gait Recognition [56.844587127848854]
We demonstrate that the state-of-the-art gait recognition model is vulnerable to such attacks.
We employ a generative adversarial network based architecture to semantically generate adversarial high-quality gait silhouettes or video frames.
The experimental results show that if only one-fortieth of the frames are attacked, the accuracy of the target model drops dramatically.
arXiv Detail & Related papers (2020-02-22T10:08:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.