A Constraint-Enforcing Reward for Adversarial Attacks on Text Classifiers
- URL: http://arxiv.org/abs/2405.11904v1
- Date: Mon, 20 May 2024 09:33:43 GMT
- Title: A Constraint-Enforcing Reward for Adversarial Attacks on Text Classifiers
- Authors: Tom Roth, Inigo Jauregi Unanue, Alsharif Abuadbba, Massimo Piccardi,
- Abstract summary: We train an encoder-decoder paraphrase model to generate adversarial examples.
We adopt a reinforcement learning algorithm and propose a constraint-enforcing reward.
We show how key design choices impact the generated examples and discuss the strengths and weaknesses of the proposed approach.
- Score: 10.063169009242682
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text classifiers are vulnerable to adversarial examples -- correctly-classified examples that are deliberately transformed to be misclassified while satisfying acceptability constraints. The conventional approach to finding adversarial examples is to define and solve a combinatorial optimisation problem over a space of allowable transformations. While effective, this approach is slow and limited by the choice of transformations. An alternate approach is to directly generate adversarial examples by fine-tuning a pre-trained language model, as is commonly done for other text-to-text tasks. This approach promises to be much quicker and more expressive, but is relatively unexplored. For this reason, in this work we train an encoder-decoder paraphrase model to generate a diverse range of adversarial examples. For training, we adopt a reinforcement learning algorithm and propose a constraint-enforcing reward that promotes the generation of valid adversarial examples. Experimental results over two text classification datasets show that our model has achieved a higher success rate than the original paraphrase model, and overall has proved more effective than other competitive attacks. Finally, we show how key design choices impact the generated examples and discuss the strengths and weaknesses of the proposed approach.
Related papers
- Reversible Jump Attack to Textual Classifiers with Modification Reduction [8.247761405798874]
Reversible Jump Attack (RJA) and Metropolis-Hasting Modification Reduction (MMR) are proposed.
RJA-MMR outperforms current state-of-the-art methods in attack performance, imperceptibility, fluency and grammar correctness.
arXiv Detail & Related papers (2024-03-21T04:54:31Z) - In and Out-of-Domain Text Adversarial Robustness via Label Smoothing [64.66809713499576]
We study the adversarial robustness provided by various label smoothing strategies in foundational models for diverse NLP tasks.
Our experiments show that label smoothing significantly improves adversarial robustness in pre-trained models like BERT, against various popular attacks.
We also analyze the relationship between prediction confidence and robustness, showing that label smoothing reduces over-confident errors on adversarial examples.
arXiv Detail & Related papers (2022-12-20T14:06:50Z) - The Enemy of My Enemy is My Friend: Exploring Inverse Adversaries for
Improving Adversarial Training [72.39526433794707]
Adversarial training and its variants have been shown to be the most effective approaches to defend against adversarial examples.
We propose a novel adversarial training scheme that encourages the model to produce similar outputs for an adversarial example and its inverse adversarial'' counterpart.
Our training method achieves state-of-the-art robustness as well as natural accuracy.
arXiv Detail & Related papers (2022-11-01T15:24:26Z) - Block-Sparse Adversarial Attack to Fool Transformer-Based Text
Classifiers [49.50163349643615]
In this paper, we propose a gradient-based adversarial attack against transformer-based text classifiers.
Experimental results demonstrate that, while our adversarial attack maintains the semantics of the sentence, it can reduce the accuracy of GPT-2 to less than 5%.
arXiv Detail & Related papers (2022-03-11T14:37:41Z) - BOSS: Bidirectional One-Shot Synthesis of Adversarial Examples [8.359029046999233]
A one-shot synthesis of adversarial examples is proposed in this paper.
The inputs are synthesized from scratch to induce arbitrary soft predictions at the output of pre-trained models.
We demonstrate the generality and versatility of the framework and approach proposed through applications to the design of targeted adversarial attacks.
arXiv Detail & Related papers (2021-08-05T17:43:36Z) - A Differentiable Language Model Adversarial Attack on Text Classifiers [10.658675415759697]
We propose a new black-box sentence-level attack for natural language processing.
Our method fine-tunes a pre-trained language model to generate adversarial examples.
We show that the proposed attack outperforms competitors on a diverse set of NLP problems for both computed metrics and human evaluation.
arXiv Detail & Related papers (2021-07-23T14:43:13Z) - Detecting Adversarial Examples by Input Transformations, Defense
Perturbations, and Voting [71.57324258813674]
convolutional neural networks (CNNs) have proved to reach super-human performance in visual recognition tasks.
CNNs can easily be fooled by adversarial examples, i.e., maliciously-crafted images that force the networks to predict an incorrect output.
This paper extensively explores the detection of adversarial examples via image transformations and proposes a novel methodology.
arXiv Detail & Related papers (2021-01-27T14:50:41Z) - On the Transferability of Adversarial Attacksagainst Neural Text
Classifier [121.6758865857686]
We investigate the transferability of adversarial examples for text classification models.
We propose a genetic algorithm to find an ensemble of models that can induce adversarial examples to fool almost all existing models.
We derive word replacement rules that can be used for model diagnostics from these adversarial examples.
arXiv Detail & Related papers (2020-11-17T10:45:05Z) - ATRO: Adversarial Training with a Rejection Option [10.36668157679368]
This paper proposes a classification framework with a rejection option to mitigate the performance deterioration caused by adversarial examples.
Applying the adversarial training objective to both a classifier and a rejection function simultaneously, we can choose to abstain from classification when it has insufficient confidence to classify a test data point.
arXiv Detail & Related papers (2020-10-24T14:05:03Z) - Contextualized Perturbation for Textual Adversarial Attack [56.370304308573274]
Adversarial examples expose the vulnerabilities of natural language processing (NLP) models.
This paper presents CLARE, a ContextuaLized AdversaRial Example generation model that produces fluent and grammatical outputs.
arXiv Detail & Related papers (2020-09-16T06:53:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.