Fundamental Tradeoffs between Invariance and Sensitivity to Adversarial
Perturbations
- URL: http://arxiv.org/abs/2002.04599v2
- Date: Tue, 4 Aug 2020 16:53:43 GMT
- Title: Fundamental Tradeoffs between Invariance and Sensitivity to Adversarial
Perturbations
- Authors: Florian Tram\`er and Jens Behrmann and Nicholas Carlini and Nicolas
Papernot and J\"orn-Henrik Jacobsen
- Abstract summary: Adversarial examples are malicious inputs crafted to induce misclassification.
This paper studies a complementary failure mode, invariance-based adversarial examples.
We show that defenses against sensitivity-based attacks actively harm a model's accuracy on invariance-based attacks.
- Score: 65.05561023880351
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adversarial examples are malicious inputs crafted to induce
misclassification. Commonly studied sensitivity-based adversarial examples
introduce semantically-small changes to an input that result in a different
model prediction. This paper studies a complementary failure mode,
invariance-based adversarial examples, that introduce minimal semantic changes
that modify an input's true label yet preserve the model's prediction. We
demonstrate fundamental tradeoffs between these two types of adversarial
examples.
We show that defenses against sensitivity-based attacks actively harm a
model's accuracy on invariance-based attacks, and that new approaches are
needed to resist both attack types. In particular, we break state-of-the-art
adversarially-trained and certifiably-robust models by generating small
perturbations that the models are (provably) robust to, yet that change an
input's class according to human labelers. Finally, we formally show that the
existence of excessively invariant classifiers arises from the presence of
overly-robust predictive features in standard datasets.
Related papers
- In and Out-of-Domain Text Adversarial Robustness via Label Smoothing [64.66809713499576]
We study the adversarial robustness provided by various label smoothing strategies in foundational models for diverse NLP tasks.
Our experiments show that label smoothing significantly improves adversarial robustness in pre-trained models like BERT, against various popular attacks.
We also analyze the relationship between prediction confidence and robustness, showing that label smoothing reduces over-confident errors on adversarial examples.
arXiv Detail & Related papers (2022-12-20T14:06:50Z) - Improving Adversarial Robustness to Sensitivity and Invariance Attacks
with Deep Metric Learning [80.21709045433096]
A standard method in adversarial robustness assumes a framework to defend against samples crafted by minimally perturbing a sample.
We use metric learning to frame adversarial regularization as an optimal transport problem.
Our preliminary results indicate that regularizing over invariant perturbations in our framework improves both invariant and sensitivity defense.
arXiv Detail & Related papers (2022-11-04T13:54:02Z) - Balanced Adversarial Training: Balancing Tradeoffs between Fickleness
and Obstinacy in NLP Models [21.06607915149245]
We show that standard adversarial training methods may make a model more vulnerable to fickle adversarial examples.
We introduce Balanced Adversarial Training, which incorporates contrastive learning to increase robustness against both fickle and obstinate adversarial examples.
arXiv Detail & Related papers (2022-10-20T18:02:07Z) - Localized Uncertainty Attacks [9.36341602283533]
We present localized uncertainty attacks against deep learning models.
We create adversarial examples by perturbing only regions in the inputs where a classifier is uncertain.
Unlike $ell_p$ ball or functional attacks which perturb inputs indiscriminately, our targeted changes can be less perceptible.
arXiv Detail & Related papers (2021-06-17T03:07:22Z) - Towards Defending against Adversarial Examples via Attack-Invariant
Features [147.85346057241605]
Deep neural networks (DNNs) are vulnerable to adversarial noise.
adversarial robustness can be improved by exploiting adversarial examples.
Models trained on seen types of adversarial examples generally cannot generalize well to unseen types of adversarial examples.
arXiv Detail & Related papers (2021-06-09T12:49:54Z) - On the Transferability of Adversarial Attacksagainst Neural Text
Classifier [121.6758865857686]
We investigate the transferability of adversarial examples for text classification models.
We propose a genetic algorithm to find an ensemble of models that can induce adversarial examples to fool almost all existing models.
We derive word replacement rules that can be used for model diagnostics from these adversarial examples.
arXiv Detail & Related papers (2020-11-17T10:45:05Z) - Asymptotic Behavior of Adversarial Training in Binary Classification [41.7567932118769]
Adversarial training is considered to be the state-of-the-art method for defense against adversarial attacks.
Despite being successful in practice, several problems in understanding performance of adversarial training remain open.
We derive precise theoretical predictions for the minimization of adversarial training in binary classification.
arXiv Detail & Related papers (2020-10-26T01:44:20Z) - Understanding Classifier Mistakes with Generative Models [88.20470690631372]
Deep neural networks are effective on supervised learning tasks, but have been shown to be brittle.
In this paper, we leverage generative models to identify and characterize instances where classifiers fail to generalize.
Our approach is agnostic to class labels from the training set which makes it applicable to models trained in a semi-supervised way.
arXiv Detail & Related papers (2020-10-05T22:13:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.