Don't Retrain, Just Rewrite: Countering Adversarial Perturbations by
Rewriting Text
- URL: http://arxiv.org/abs/2305.16444v1
- Date: Thu, 25 May 2023 19:42:51 GMT
- Title: Don't Retrain, Just Rewrite: Countering Adversarial Perturbations by
Rewriting Text
- Authors: Ashim Gupta, Carter Wood Blum, Temma Choji, Yingjie Fei, Shalin Shah,
Alakananda Vempala, Vivek Srikumar
- Abstract summary: We present ATINTER, a model that intercepts and learns to rewrite adversarial inputs to make them non-adversarial.
Our experiments reveal that ATINTER is effective at providing better adversarial robustness than existing defense approaches.
- Score: 40.491180210205556
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Can language models transform inputs to protect text classifiers against
adversarial attacks? In this work, we present ATINTER, a model that intercepts
and learns to rewrite adversarial inputs to make them non-adversarial for a
downstream text classifier. Our experiments on four datasets and five attack
mechanisms reveal that ATINTER is effective at providing better adversarial
robustness than existing defense approaches, without compromising task
accuracy. For example, on sentiment classification using the SST-2 dataset, our
method improves the adversarial accuracy over the best existing defense
approach by more than 4% with a smaller decrease in task accuracy (0.5% vs
2.5%). Moreover, we show that ATINTER generalizes across multiple downstream
tasks and classifiers without having to explicitly retrain it for those
settings. Specifically, we find that when ATINTER is trained to remove
adversarial perturbations for the sentiment classification task on the SST-2
dataset, it even transfers to a semantically different task of news
classification (on AGNews) and improves the adversarial robustness by more than
10%.
Related papers
- Forging the Forger: An Attempt to Improve Authorship Verification via Data Augmentation [52.72682366640554]
Authorship Verification (AV) is a text classification task concerned with inferring whether a candidate text has been written by one specific author or by someone else.
It has been shown that many AV systems are vulnerable to adversarial attacks, where a malicious author actively tries to fool the classifier by either concealing their writing style, or by imitating the style of another author.
arXiv Detail & Related papers (2024-03-17T16:36:26Z) - Single Word Change is All You Need: Designing Attacks and Defenses for
Text Classifiers [12.167426402230229]
A significant portion of adversarial examples generated by existing methods change only one word.
This single-word perturbation vulnerability represents a significant weakness in classifiers.
We present the SP-Attack, designed to exploit the single-word perturbation vulnerability, achieving a higher attack success rate.
We also propose SP-Defense, which aims to improve rho by applying data augmentation in learning.
arXiv Detail & Related papers (2024-01-30T17:30:44Z) - Adversarial Attacks Neutralization via Data Set Randomization [3.655021726150369]
Adversarial attacks on deep learning models pose a serious threat to their reliability and security.
We propose a new defense mechanism that is rooted on hyperspace projection.
We show that our solution increases the robustness of deep learning models against adversarial attacks.
arXiv Detail & Related papers (2023-06-21T10:17:55Z) - Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models.
We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - Detection and Mitigation of Byzantine Attacks in Distributed Training [24.951227624475443]
An abnormal Byzantine behavior of the worker nodes can derail the training and compromise the quality of the inference.
Recent work considers a wide range of attack models and has explored robust aggregation and/or computational redundancy to correct the distorted gradients.
In this work, we consider attack models ranging from strong ones: $q$ omniscient adversaries with full knowledge of the defense protocol that can change from iteration to iteration to weak ones: $q$ randomly chosen adversaries with limited collusion abilities.
arXiv Detail & Related papers (2022-08-17T05:49:52Z) - Don't sweat the small stuff, classify the rest: Sample Shielding to
protect text classifiers against adversarial attacks [2.512827436728378]
Deep learning (DL) is being used extensively for text classification.
Attackers modify the text in a way which misleads the classifier while keeping the original meaning close to intact.
We propose a novel and intuitive defense strategy called Sample Shielding.
arXiv Detail & Related papers (2022-05-03T18:24:20Z) - Block-Sparse Adversarial Attack to Fool Transformer-Based Text
Classifiers [49.50163349643615]
In this paper, we propose a gradient-based adversarial attack against transformer-based text classifiers.
Experimental results demonstrate that, while our adversarial attack maintains the semantics of the sentence, it can reduce the accuracy of GPT-2 to less than 5%.
arXiv Detail & Related papers (2022-03-11T14:37:41Z) - Modelling Adversarial Noise for Adversarial Defense [96.56200586800219]
adversarial defenses typically focus on exploiting adversarial examples to remove adversarial noise or train an adversarially robust target model.
Motivated by that the relationship between adversarial data and natural data can help infer clean data from adversarial data to obtain the final correct prediction.
We study to model adversarial noise to learn the transition relationship in the label space for using adversarial labels to improve adversarial accuracy.
arXiv Detail & Related papers (2021-09-21T01:13:26Z) - Adversarial Self-Supervised Contrastive Learning [62.17538130778111]
Existing adversarial learning approaches mostly use class labels to generate adversarial samples that lead to incorrect predictions.
We propose a novel adversarial attack for unlabeled data, which makes the model confuse the instance-level identities of the perturbed data samples.
We present a self-supervised contrastive learning framework to adversarially train a robust neural network without labeled data.
arXiv Detail & Related papers (2020-06-13T08:24:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.