Related papers: Efficient Adversarial Training in LLMs with Continuous Attacks

Efficient Adversarial Training in LLMs with Continuous Attacks

URL: http://arxiv.org/abs/2405.15589v3
Date: Fri, 01 Nov 2024 16:39:36 GMT
Title: Efficient Adversarial Training in LLMs with Continuous Attacks
Authors: Sophie Xhonneux, Alessandro Sordoni, Stephan Günnemann, Gauthier Gidel, Leo Schwinn,
Abstract summary: Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails. We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses. C-AdvIPO is an adversarial variant of IPO that does not require utility data for adversarially robust alignment.
Score: 99.5882845458567
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails. In many domains, adversarial training has proven to be one of the most promising methods to reliably improve robustness against such attacks. Yet, in the context of LLMs, current methods for adversarial training are hindered by the high computational costs required to perform discrete adversarial attacks at each training iteration. We address this problem by instead calculating adversarial attacks in the continuous embedding space of the LLM, which is orders of magnitudes more efficient. We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses: the first makes the model robust on continuous embedding attacks computed on an adversarial behaviour dataset; the second ensures the usefulness of the final model by fine-tuning on utility data. Moreover, we introduce C-AdvIPO, an adversarial variant of IPO that does not require utility data for adversarially robust alignment. Our empirical evaluation on five models from different families (Gemma, Phi3, Mistral, Zephyr, Llama2) and at different scales (2B, 3.8B, 7B) shows that both algorithms substantially enhance LLM robustness against discrete attacks (GCG, AutoDAN, PAIR), while maintaining utility. Our results demonstrate that robustness to continuous perturbations can extrapolate to discrete threat models. Thereby, we present a path toward scalable adversarial training algorithms for robustly aligning LLMs.

Related papers

MixAT: Combining Continuous and Discrete Adversarial Training for LLMs [2.679689033125693]
MixAT is a novel method that combines stronger discrete and faster continuous attacks during training.<n>We show MixAT achieves substantially better robustness (ALO-ASR 20%) compared to prior defenses.<n>Our results demonstrate that MixAT's discrete-continuous defense offers a principled and superior-accuracy tradeoff with minimal computational overhead.
arXiv Detail & Related papers (2025-05-22T17:32:50Z)
Adversarial Robustness in Two-Stage Learning-to-Defer: Algorithms and Guarantees [3.6787328174619254]
Two-stage Learning-to-Defer (L2D) enables optimal task delegation by assigning each input to either a fixed main model or one of several offline experts.<n>Existing L2D frameworks assume clean inputs and are vulnerable to adversarial perturbations that can manipulate query allocation.<n>We present the first comprehensive study of adversarial robustness in two-stage L2D systems.
arXiv Detail & Related papers (2025-02-03T03:44:35Z)
Robust LLM safeguarding via refusal feature adversarial training [15.76605079209956]
Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful responses. We propose Refusal Feature Adrial Training (ReFAT), a novel algorithm that efficiently performs adversarial training. Experiment results show that ReFAT significantly improves the robustness of three popular LLMs against a wide range of adversarial attacks.
arXiv Detail & Related papers (2024-09-30T08:41:39Z)
Discriminative Adversarial Unlearning [40.30974185546541]
We introduce a novel machine unlearning framework founded upon the established principles of the min-max optimization paradigm. We capitalize on the capabilities of strong Membership Inference Attacks (MIA) to facilitate the unlearning of specific samples from a trained model. Our proposed algorithm closely approximates the ideal benchmark of retraining from scratch for both random sample forgetting and class-wise forgetting schemes.
arXiv Detail & Related papers (2024-02-10T03:04:57Z)
Robust Multi-Agent Reinforcement Learning by Mutual Information Regularization [15.11457665677937]
Existing robust MARL methods either approximate or enumerate all possible threat scenarios against worst-case adversaries. We frame robust MARL as an inference problem, with worst-case robustness implicitly optimized under all threat scenarios. Within this framework, we demonstrate that Mutual Information Regularization as Robust Regularization (MIR3) during routine training is guaranteed to maximize a lower bound on robustness.
arXiv Detail & Related papers (2023-10-15T13:35:51Z)
Outlier Robust Adversarial Training [57.06824365801612]
We introduce Outlier Robust Adversarial Training (ORAT) in this work. ORAT is based on a bi-level optimization formulation of adversarial training with a robust rank-based loss function. We show that the learning objective of ORAT satisfies the $mathcalH$-consistency in binary classification, which establishes it as a proper surrogate to adversarial 0/1 loss.
arXiv Detail & Related papers (2023-09-10T21:36:38Z)
Doubly Robust Instance-Reweighted Adversarial Training [107.40683655362285]
We propose a novel doubly-robust instance reweighted adversarial framework. Our importance weights are obtained by optimizing the KL-divergence regularized loss function. Our proposed approach outperforms related state-of-the-art baseline methods in terms of average robust performance.
arXiv Detail & Related papers (2023-08-01T06:16:18Z)
Adversarial Training Should Be Cast as a Non-Zero-Sum Game [121.95628660889628]
Two-player zero-sum paradigm of adversarial training has not engendered sufficient levels of robustness. We show that the commonly used surrogate-based relaxation used in adversarial training algorithms voids all guarantees on robustness. A novel non-zero-sum bilevel formulation of adversarial training yields a framework that matches and in some cases outperforms state-of-the-art attacks.
arXiv Detail & Related papers (2023-06-19T16:00:48Z)
Distributed Adversarial Training to Robustify Deep Neural Networks at Scale [100.19539096465101]
Current deep neural networks (DNNs) are vulnerable to adversarial attacks, where adversarial perturbations to the inputs can change or manipulate classification. To defend against such attacks, an effective approach, known as adversarial training (AT), has been shown to mitigate robust training. We propose a large-batch adversarial training framework implemented over multiple machines.
arXiv Detail & Related papers (2022-06-13T15:39:43Z)
Robust Pre-Training by Adversarial Contrastive Learning [120.33706897927391]
Recent work has shown that, when integrated with adversarial training, self-supervised pre-training can lead to state-of-the-art robustness. We improve robustness-aware self-supervised pre-training by learning representations consistent under both data augmentations and adversarial perturbations.
arXiv Detail & Related papers (2020-10-26T04:44:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.