Balanced Adversarial Training: Balancing Tradeoffs between Fickleness
and Obstinacy in NLP Models
- URL: http://arxiv.org/abs/2210.11498v1
- Date: Thu, 20 Oct 2022 18:02:07 GMT
- Title: Balanced Adversarial Training: Balancing Tradeoffs between Fickleness
and Obstinacy in NLP Models
- Authors: Hannah Chen, Yangfeng Ji, David Evans
- Abstract summary: We show that standard adversarial training methods may make a model more vulnerable to fickle adversarial examples.
We introduce Balanced Adversarial Training, which incorporates contrastive learning to increase robustness against both fickle and obstinate adversarial examples.
- Score: 21.06607915149245
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Traditional (fickle) adversarial examples involve finding a small
perturbation that does not change an input's true label but confuses the
classifier into outputting a different prediction. Conversely, obstinate
adversarial examples occur when an adversary finds a small perturbation that
preserves the classifier's prediction but changes the true label of an input.
Adversarial training and certified robust training have shown some
effectiveness in improving the robustness of machine learnt models to fickle
adversarial examples. We show that standard adversarial training methods
focused on reducing vulnerability to fickle adversarial examples may make a
model more vulnerable to obstinate adversarial examples, with experiments for
both natural language inference and paraphrase identification tasks. To counter
this phenomenon, we introduce Balanced Adversarial Training, which incorporates
contrastive learning to increase robustness against both fickle and obstinate
adversarial examples.
Related papers
- On the Effect of Adversarial Training Against Invariance-based
Adversarial Examples [0.23624125155742057]
This work addresses the impact of adversarial training with invariance-based adversarial examples on a convolutional neural network (CNN)
We show that when adversarial training with invariance-based and perturbation-based adversarial examples is applied, it should be conducted simultaneously and not consecutively.
arXiv Detail & Related papers (2023-02-16T12:35:37Z) - The Enemy of My Enemy is My Friend: Exploring Inverse Adversaries for
Improving Adversarial Training [72.39526433794707]
Adversarial training and its variants have been shown to be the most effective approaches to defend against adversarial examples.
We propose a novel adversarial training scheme that encourages the model to produce similar outputs for an adversarial example and its inverse adversarial'' counterpart.
Our training method achieves state-of-the-art robustness as well as natural accuracy.
arXiv Detail & Related papers (2022-11-01T15:24:26Z) - Robust Transferable Feature Extractors: Learning to Defend Pre-Trained
Networks Against White Box Adversaries [69.53730499849023]
We show that adversarial examples can be successfully transferred to another independently trained model to induce prediction errors.
We propose a deep learning-based pre-processing mechanism, which we refer to as a robust transferable feature extractor (RTFE)
arXiv Detail & Related papers (2022-09-14T21:09:34Z) - Collaborative Adversarial Training [82.25340762659991]
We show that some collaborative examples, nearly perceptually indistinguishable from both adversarial and benign examples, can be utilized to enhance adversarial training.
A novel method called collaborative adversarial training (CoAT) is thus proposed to achieve new state-of-the-arts.
arXiv Detail & Related papers (2022-05-23T09:41:41Z) - On the Impact of Hard Adversarial Instances on Overfitting in
Adversarial Training [72.95029777394186]
Adversarial training is a popular method to robustify models against adversarial attacks.
We investigate this phenomenon from the perspective of training instances.
We show that the decay in generalization performance of adversarial training is a result of the model's attempt to fit hard adversarial instances.
arXiv Detail & Related papers (2021-12-14T12:19:24Z) - Calibrated Adversarial Training [8.608288231153304]
We present the Calibrated Adversarial Training, a method that reduces the adverse effects of semantic perturbations in adversarial training.
The method produces pixel-level adaptations to the perturbations based on novel calibrated robust error.
arXiv Detail & Related papers (2021-10-01T19:17:28Z) - CLINE: Contrastive Learning with Semantic Negative Examples for Natural
Language Understanding [35.003401250150034]
We propose Contrastive Learning with semantIc Negative Examples (CLINE) to improve robustness of pre-trained language models.
CLINE constructs semantic negative examples unsupervised to improve the robustness under semantically adversarial attacking.
Empirical results show that our approach yields substantial improvements on a range of sentiment analysis, reasoning, and reading comprehension tasks.
arXiv Detail & Related papers (2021-07-01T13:34:12Z) - Robust Pre-Training by Adversarial Contrastive Learning [120.33706897927391]
Recent work has shown that, when integrated with adversarial training, self-supervised pre-training can lead to state-of-the-art robustness.
We improve robustness-aware self-supervised pre-training by learning representations consistent under both data augmentations and adversarial perturbations.
arXiv Detail & Related papers (2020-10-26T04:44:43Z) - Semantics-Preserving Adversarial Training [12.242659601882147]
Adversarial training is a technique that improves adversarial robustness of a deep neural network (DNN) by including adversarial examples in the training data.
We propose semantics-preserving adversarial training (SPAT) which encourages perturbation on the pixels that are shared among all classes.
Experiment results show that SPAT improves adversarial robustness and achieves state-of-the-art results in CIFAR-10 and CIFAR-100.
arXiv Detail & Related papers (2020-09-23T07:42:14Z) - Fundamental Tradeoffs between Invariance and Sensitivity to Adversarial
Perturbations [65.05561023880351]
Adversarial examples are malicious inputs crafted to induce misclassification.
This paper studies a complementary failure mode, invariance-based adversarial examples.
We show that defenses against sensitivity-based attacks actively harm a model's accuracy on invariance-based attacks.
arXiv Detail & Related papers (2020-02-11T18:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.