Break it, Imitate it, Fix it: Robustness by Generating Human-Like
Attacks
- URL: http://arxiv.org/abs/2310.16955v2
- Date: Wed, 14 Feb 2024 20:01:11 GMT
- Title: Break it, Imitate it, Fix it: Robustness by Generating Human-Like
Attacks
- Authors: Aradhana Sinha, Ananth Balashankar, Ahmad Beirami, Thi Avrahami, Jilin
Chen, Alex Beutel
- Abstract summary: We propose an adversarial training framework that uses limited human adversarial examples to generate more useful adversarial examples at scale.
We demonstrate the advantages of this system on the ANLI and hate speech detection benchmark datasets.
- Score: 18.66548052614702
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real-world natural language processing systems need to be robust to human
adversaries. Collecting examples of human adversaries for training is an
effective but expensive solution. On the other hand, training on synthetic
attacks with small perturbations - such as word-substitution - does not
actually improve robustness to human adversaries. In this paper, we propose an
adversarial training framework that uses limited human adversarial examples to
generate more useful adversarial examples at scale. We demonstrate the
advantages of this system on the ANLI and hate speech detection benchmark
datasets - both collected via an iterative, adversarial
human-and-model-in-the-loop procedure. Compared to training only on observed
human attacks, also training on our synthetic adversarial examples improves
model robustness to future rounds. In ANLI, we see accuracy gains on the
current set of attacks (44.1%$\,\to\,$50.1%) and on two future unseen rounds of
human generated attacks (32.5%$\,\to\,$43.4%, and 29.4%$\,\to\,$40.2%). In hate
speech detection, we see AUC gains on current attacks (0.76 $\to$ 0.84) and a
future round (0.77 $\to$ 0.79). Attacks from methods that do not learn the
distribution of existing human adversaries, meanwhile, degrade robustness.
Related papers
- Improving Large Language Model Safety with Contrastive Representation Learning [92.79965952162298]
Large Language Models (LLMs) are powerful tools with profound societal impacts.<n>Their ability to generate responses to diverse and uncontrolled inputs leaves them vulnerable to adversarial attacks.<n>We propose a defense framework that formulates model defense as a contrastive representation learning problem.
arXiv Detail & Related papers (2025-06-13T16:42:09Z) - Can Go AIs be adversarially robust? [4.466856575755327]
We study whether adding natural countermeasures can achieve robustness in Go.<n>We find that though some of these defenses protect against previously discovered attacks, none withstand freshly trained adversaries.<n>Our results suggest that building robust AI systems is challenging even with extremely superhuman systems in some of the most tractable settings.
arXiv Detail & Related papers (2024-06-18T17:57:49Z) - Asymmetric Bias in Text-to-Image Generation with Adversarial Attacks [21.914674640285337]
This paper focuses on analyzing factors associated with attack success rates (ASR)
We introduce a new attack objective - entity swapping using adversarial suffixes and two gradient-based attack algorithms.
We identify conditions that result in a success probability of 60% for adversarial attacks and others where this likelihood drops below 5%.
arXiv Detail & Related papers (2023-12-22T05:10:32Z) - Outlier Robust Adversarial Training [57.06824365801612]
We introduce Outlier Robust Adversarial Training (ORAT) in this work.
ORAT is based on a bi-level optimization formulation of adversarial training with a robust rank-based loss function.
We show that the learning objective of ORAT satisfies the $mathcalH$-consistency in binary classification, which establishes it as a proper surrogate to adversarial 0/1 loss.
arXiv Detail & Related papers (2023-09-10T21:36:38Z) - The Best Defense is a Good Offense: Adversarial Augmentation against
Adversarial Attacks [91.56314751983133]
$A5$ is a framework to craft a defensive perturbation to guarantee that any attack towards the input in hand will fail.
We show effective on-the-fly defensive augmentation with a robustifier network that ignores the ground truth label.
We also show how to apply $A5$ to create certifiably robust physical objects.
arXiv Detail & Related papers (2023-05-23T16:07:58Z) - Improved Adversarial Training Through Adaptive Instance-wise Loss
Smoothing [5.1024659285813785]
Adversarial training has been the most successful defense against such adversarial attacks.
We propose a new adversarial training method: Instance-adaptive Smoothness Enhanced Adversarial Training.
Our method achieves state-of-the-art robustness against $ell_infty$-norm constrained attacks.
arXiv Detail & Related papers (2023-03-24T15:41:40Z) - Improving the Adversarial Robustness for Speaker Verification by Self-Supervised Learning [95.60856995067083]
This work is among the first to perform adversarial defense for ASV without knowing the specific attack algorithms.
We propose to perform adversarial defense from two perspectives: 1) adversarial perturbation purification and 2) adversarial perturbation detection.
Experimental results show that our detection module effectively shields the ASV by detecting adversarial samples with an accuracy of around 80%.
arXiv Detail & Related papers (2021-06-01T07:10:54Z) - Universal Adversarial Training with Class-Wise Perturbations [78.05383266222285]
adversarial training is the most widely used method for defending against adversarial attacks.
In this work, we find that a UAP does not attack all classes equally.
We improve the SOTA UAT by proposing to utilize class-wise UAPs during adversarial training.
arXiv Detail & Related papers (2021-04-07T09:05:49Z) - What Doesn't Kill You Makes You Robust(er): Adversarial Training against
Poisons and Backdoors [57.040948169155925]
We extend the adversarial training framework to defend against (training-time) poisoning and backdoor attacks.
Our method desensitizes networks to the effects of poisoning by creating poisons during training and injecting them into training batches.
We show that this defense withstands adaptive attacks, generalizes to diverse threat models, and incurs a better performance trade-off than previous defenses.
arXiv Detail & Related papers (2021-02-26T17:54:36Z) - Target Training Does Adversarial Training Without Adversarial Samples [0.10152838128195464]
adversarial samples are not optimal for steering attack convergence, based on the minimization at the core of adversarial attacks.
Target Training eliminates the need to generate adversarial samples for training against all attacks that minimize perturbation.
Using adversarial samples against attacks that do not minimize perturbation, Target Training exceeds current best defense ($69.1$%) with $76.4$% against CW-L$($kappa=40$) in CIFAR10.
arXiv Detail & Related papers (2021-02-09T14:17:57Z) - Semantics-Preserving Adversarial Training [12.242659601882147]
Adversarial training is a technique that improves adversarial robustness of a deep neural network (DNN) by including adversarial examples in the training data.
We propose semantics-preserving adversarial training (SPAT) which encourages perturbation on the pixels that are shared among all classes.
Experiment results show that SPAT improves adversarial robustness and achieves state-of-the-art results in CIFAR-10 and CIFAR-100.
arXiv Detail & Related papers (2020-09-23T07:42:14Z) - Perceptual Adversarial Robustness: Defense Against Unseen Threat Models [58.47179090632039]
A key challenge in adversarial robustness is the lack of a precise mathematical characterization of human perception.
Under the neural perceptual threat model, we develop novel perceptual adversarial attacks and defenses.
Because the NPTM is very broad, we find that Perceptual Adrial Training (PAT) against a perceptual attack gives robustness against many other types of adversarial attacks.
arXiv Detail & Related papers (2020-06-22T22:40:46Z) - Using Single-Step Adversarial Training to Defend Iterative Adversarial
Examples [6.609200722223488]
We propose a novel single-step adversarial training method which can defend against both single-step and iterative adversarial examples.
Our proposed method achieves 35.67% enhancement in test accuracy and 19.14% reduction in training time.
arXiv Detail & Related papers (2020-02-22T05:36:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.