Related papers: Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks

Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks

URL: http://arxiv.org/abs/2310.16955v2
Date: Wed, 14 Feb 2024 20:01:11 GMT
Title: Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks
Authors: Aradhana Sinha, Ananth Balashankar, Ahmad Beirami, Thi Avrahami, Jilin Chen, Alex Beutel
Abstract summary: We propose an adversarial training framework that uses limited human adversarial examples to generate more useful adversarial examples at scale. We demonstrate the advantages of this system on the ANLI and hate speech detection benchmark datasets.
Score: 18.66548052614702
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Real-world natural language processing systems need to be robust to human adversaries. Collecting examples of human adversaries for training is an effective but expensive solution. On the other hand, training on synthetic attacks with small perturbations - such as word-substitution - does not actually improve robustness to human adversaries. In this paper, we propose an adversarial training framework that uses limited human adversarial examples to generate more useful adversarial examples at scale. We demonstrate the advantages of this system on the ANLI and hate speech detection benchmark datasets - both collected via an iterative, adversarial human-and-model-in-the-loop procedure. Compared to training only on observed human attacks, also training on our synthetic adversarial examples improves model robustness to future rounds. In ANLI, we see accuracy gains on the current set of attacks (44.1%$\,\to\,$50.1%) and on two future unseen rounds of human generated attacks (32.5%$\,\to\,$43.4%, and 29.4%$\,\to\,$40.2%). In hate speech detection, we see AUC gains on current attacks (0.76 $\to$ 0.84) and a future round (0.77 $\to$ 0.79). Attacks from methods that do not learn the distribution of existing human adversaries, meanwhile, degrade robustness.

Related papers

Improving Large Language Model Safety with Contrastive Representation Learning [92.79965952162298]
Large Language Models (LLMs) are powerful tools with profound societal impacts.<n>Their ability to generate responses to diverse and uncontrolled inputs leaves them vulnerable to adversarial attacks.<n>We propose a defense framework that formulates model defense as a contrastive representation learning problem.
arXiv Detail & Related papers (2025-06-13T16:42:09Z)
Can Go AIs be adversarially robust? [4.466856575755327]
We study whether adding natural countermeasures can achieve robustness in Go.<n>We find that though some of these defenses protect against previously discovered attacks, none withstand freshly trained adversaries.<n>Our results suggest that building robust AI systems is challenging even with extremely superhuman systems in some of the most tractable settings.
arXiv Detail & Related papers (2024-06-18T17:57:49Z)
Asymmetric Bias in Text-to-Image Generation with Adversarial Attacks [21.914674640285337]
This paper focuses on analyzing factors associated with attack success rates (ASR) We introduce a new attack objective - entity swapping using adversarial suffixes and two gradient-based attack algorithms. We identify conditions that result in a success probability of 60% for adversarial attacks and others where this likelihood drops below 5%.
arXiv Detail & Related papers (2023-12-22T05:10:32Z)
Outlier Robust Adversarial Training [57.06824365801612]
We introduce Outlier Robust Adversarial Training (ORAT) in this work. ORAT is based on a bi-level optimization formulation of adversarial training with a robust rank-based loss function. We show that the learning objective of ORAT satisfies the $mathcalH$-consistency in binary classification, which establishes it as a proper surrogate to adversarial 0/1 loss.
arXiv Detail & Related papers (2023-09-10T21:36:38Z)
The Best Defense is a Good Offense: Adversarial Augmentation against Adversarial Attacks [91.56314751983133]
$A5$ is a framework to craft a defensive perturbation to guarantee that any attack towards the input in hand will fail. We show effective on-the-fly defensive augmentation with a robustifier network that ignores the ground truth label. We also show how to apply $A5$ to create certifiably robust physical objects.
arXiv Detail & Related papers (2023-05-23T16:07:58Z)
Improved Adversarial Training Through Adaptive Instance-wise Loss Smoothing [5.1024659285813785]
Adversarial training has been the most successful defense against such adversarial attacks. We propose a new adversarial training method: Instance-adaptive Smoothness Enhanced Adversarial Training. Our method achieves state-of-the-art robustness against $ell_infty$-norm constrained attacks.
arXiv Detail & Related papers (2023-03-24T15:41:40Z)
Improving the Adversarial Robustness for Speaker Verification by Self-Supervised Learning [95.60856995067083]
This work is among the first to perform adversarial defense for ASV without knowing the specific attack algorithms. We propose to perform adversarial defense from two perspectives: 1) adversarial perturbation purification and 2) adversarial perturbation detection. Experimental results show that our detection module effectively shields the ASV by detecting adversarial samples with an accuracy of around 80%.
arXiv Detail & Related papers (2021-06-01T07:10:54Z)
Universal Adversarial Training with Class-Wise Perturbations [78.05383266222285]
adversarial training is the most widely used method for defending against adversarial attacks. In this work, we find that a UAP does not attack all classes equally. We improve the SOTA UAT by proposing to utilize class-wise UAPs during adversarial training.
arXiv Detail & Related papers (2021-04-07T09:05:49Z)
What Doesn't Kill You Makes You Robust(er): Adversarial Training against Poisons and Backdoors [57.040948169155925]
We extend the adversarial training framework to defend against (training-time) poisoning and backdoor attacks. Our method desensitizes networks to the effects of poisoning by creating poisons during training and injecting them into training batches. We show that this defense withstands adaptive attacks, generalizes to diverse threat models, and incurs a better performance trade-off than previous defenses.
arXiv Detail & Related papers (2021-02-26T17:54:36Z)
Target Training Does Adversarial Training Without Adversarial Samples [0.10152838128195464]
adversarial samples are not optimal for steering attack convergence, based on the minimization at the core of adversarial attacks. Target Training eliminates the need to generate adversarial samples for training against all attacks that minimize perturbation. Using adversarial samples against attacks that do not minimize perturbation, Target Training exceeds current best defense ($69.1$%) with $76.4$% against CW-L$($kappa=40$) in CIFAR10.
arXiv Detail & Related papers (2021-02-09T14:17:57Z)
Semantics-Preserving Adversarial Training [12.242659601882147]
Adversarial training is a technique that improves adversarial robustness of a deep neural network (DNN) by including adversarial examples in the training data. We propose semantics-preserving adversarial training (SPAT) which encourages perturbation on the pixels that are shared among all classes. Experiment results show that SPAT improves adversarial robustness and achieves state-of-the-art results in CIFAR-10 and CIFAR-100.
arXiv Detail & Related papers (2020-09-23T07:42:14Z)
Perceptual Adversarial Robustness: Defense Against Unseen Threat Models [58.47179090632039]
A key challenge in adversarial robustness is the lack of a precise mathematical characterization of human perception. Under the neural perceptual threat model, we develop novel perceptual adversarial attacks and defenses. Because the NPTM is very broad, we find that Perceptual Adrial Training (PAT) against a perceptual attack gives robustness against many other types of adversarial attacks.
arXiv Detail & Related papers (2020-06-22T22:40:46Z)
Using Single-Step Adversarial Training to Defend Iterative Adversarial Examples [6.609200722223488]
We propose a novel single-step adversarial training method which can defend against both single-step and iterative adversarial examples. Our proposed method achieves 35.67% enhancement in test accuracy and 19.14% reduction in training time.
arXiv Detail & Related papers (2020-02-22T05:36:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.