Closing the Distribution Gap in Adversarial Training for LLMs
- URL: http://arxiv.org/abs/2602.15238v2
- Date: Wed, 18 Feb 2026 17:57:10 GMT
- Title: Closing the Distribution Gap in Adversarial Training for LLMs
- Authors: Chengzhi Hu, Jonas Dornbusch, David Lüdke, Stephan Günnemann, Leo Schwinn,
- Abstract summary: Adversarial training for LLMs is one of the most promising methods to reliably improve robustness against adversaries.<n>We argue that current adversarial training algorithms minimize adversarial loss on their training set but inadequately cover the data distribution, resulting in vulnerability to seemingly simple attacks.<n>We propose Distributional Adversarial Training, DAT, to approximate the true joint distribution of prompts and responses.
- Score: 50.33186122381395
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Adversarial training for LLMs is one of the most promising methods to reliably improve robustness against adversaries. However, despite significant progress, models remain vulnerable to simple in-distribution exploits, such as rewriting prompts in the past tense or translating them into other languages. We argue that this persistent fragility stems from a fundamental limitation in current adversarial training algorithms: they minimize adversarial loss on their training set but inadequately cover the data distribution, resulting in vulnerability to seemingly simple attacks. To bridge this gap, we propose Distributional Adversarial Training, DAT. We leverage Diffusion LLMs to approximate the true joint distribution of prompts and responses, enabling generation of diverse, high-likelihood samples that address generalization failures. By combining optimization over the data distribution provided by the diffusion model with continuous adversarial training, DAT achieves substantially higher adversarial robustness than previous methods.
Related papers
- Deep Leakage with Generative Flow Matching Denoiser [54.05993847488204]
We introduce a new deep leakage (DL) attack that integrates a generative Flow Matching (FM) prior into the reconstruction process.<n>Our approach consistently outperforms state-of-the-art attacks across pixel-level, perceptual, and feature-based similarity metrics.
arXiv Detail & Related papers (2026-01-21T14:51:01Z) - Data-regularized Reinforcement Learning for Diffusion Models at Scale [99.01056178660538]
We introduce Data-regularized Diffusion Reinforcement Learning ( DDRL), a novel framework that uses the forward KL divergence to anchor the policy to an off-policy data distribution.<n>With over a million GPU hours of experiments and ten thousand double-blind evaluations, we demonstrate that DDRL significantly improves rewards while alleviating the reward hacking seen in RLs.
arXiv Detail & Related papers (2025-12-03T23:45:07Z) - Retracing the Past: LLMs Emit Training Data When They Get Lost [18.852558767604823]
memorization of training data in large language models poses significant privacy and copyright concerns.<n>This paper introduces Confusion-Inducing Attacks (CIA), a principled framework for extracting memorized data.
arXiv Detail & Related papers (2025-10-27T03:48:24Z) - MixAT: Combining Continuous and Discrete Adversarial Training for LLMs [10.570402333857261]
MixAT is a novel method that combines stronger discrete and faster continuous attacks during training.<n>We show MixAT achieves substantially better robustness (ALO-ASR 20%) compared to prior defenses.<n>Our results demonstrate that MixAT's discrete-continuous defense offers a principled and superior-accuracy tradeoff with minimal computational overhead.
arXiv Detail & Related papers (2025-05-22T17:32:50Z) - Efficient Adversarial Training in LLMs with Continuous Attacks [99.5882845458567]
Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails.
We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses.
C-AdvIPO is an adversarial variant of IPO that does not require utility data for adversarially robust alignment.
arXiv Detail & Related papers (2024-05-24T14:20:09Z) - DSRM: Boost Textual Adversarial Training with Distribution Shift Risk
Minimization [36.10642858867033]
Adversarial training is one of the best-performing methods in improving the robustness of deep language models.
We introduce a novel, effective procedure for instead adversarial training with only clean data.
Our approach requires zero adversarial samples for training and reduces time consumption by up to 70% compared to current best-performing adversarial training methods.
arXiv Detail & Related papers (2023-06-27T02:46:08Z) - Learning Diverse Representations for Fast Adaptation to Distribution
Shift [78.83747601814669]
We present a method for learning multiple models, incorporating an objective that pressures each to learn a distinct way to solve the task.
We demonstrate our framework's ability to facilitate rapid adaptation to distribution shift.
arXiv Detail & Related papers (2020-06-12T12:23:50Z) - Adversarial Distributional Training for Robust Deep Learning [53.300984501078126]
Adversarial training (AT) is among the most effective techniques to improve model robustness by augmenting training data with adversarial examples.
Most existing AT methods adopt a specific attack to craft adversarial examples, leading to the unreliable robustness against other unseen attacks.
In this paper, we introduce adversarial distributional training (ADT), a novel framework for learning robust models.
arXiv Detail & Related papers (2020-02-14T12:36:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.