MixAT: Combining Continuous and Discrete Adversarial Training for LLMs
- URL: http://arxiv.org/abs/2505.16947v1
- Date: Thu, 22 May 2025 17:32:50 GMT
- Title: MixAT: Combining Continuous and Discrete Adversarial Training for LLMs
- Authors: Csaba Dékány, Stefan Balauca, Robin Staab, Dimitar I. Dimitrov, Martin Vechev,
- Abstract summary: MixAT is a novel method that combines stronger discrete and faster continuous attacks during training.<n>We show MixAT achieves substantially better robustness (ALO-ASR 20%) compared to prior defenses.<n>Our results demonstrate that MixAT's discrete-continuous defense offers a principled and superior-accuracy tradeoff with minimal computational overhead.
- Score: 2.679689033125693
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Despite recent efforts in Large Language Models (LLMs) safety and alignment, current adversarial attacks on frontier LLMs are still able to force harmful generations consistently. Although adversarial training has been widely studied and shown to significantly improve the robustness of traditional machine learning models, its strengths and weaknesses in the context of LLMs are less understood. Specifically, while existing discrete adversarial attacks are effective at producing harmful content, training LLMs with concrete adversarial prompts is often computationally expensive, leading to reliance on continuous relaxations. As these relaxations do not correspond to discrete input tokens, such latent training methods often leave models vulnerable to a diverse set of discrete attacks. In this work, we aim to bridge this gap by introducing MixAT, a novel method that combines stronger discrete and faster continuous attacks during training. We rigorously evaluate MixAT across a wide spectrum of state-of-the-art attacks, proposing the At Least One Attack Success Rate (ALO-ASR) metric to capture the worst-case vulnerability of models. We show MixAT achieves substantially better robustness (ALO-ASR < 20%) compared to prior defenses (ALO-ASR > 50%), while maintaining a runtime comparable to methods based on continuous relaxations. We further analyze MixAT in realistic deployment settings, exploring how chat templates, quantization, low-rank adapters, and temperature affect both adversarial training and evaluation, revealing additional blind spots in current methodologies. Our results demonstrate that MixAT's discrete-continuous defense offers a principled and superior robustness-accuracy tradeoff with minimal computational overhead, highlighting its promise for building safer LLMs. We provide our code and models at https://github.com/insait-institute/MixAT.
Related papers
- MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming [38.25556351567948]
textbfMulti-textbfTurn textbfSafety textbfAlignment (ourapproach) framework for securing large language models.<n>Red-team model learns about thought-guided multi-round jailbreak attacks to generate adversarial prompts.<n> adversarial iterative optimization stage, the red-team model and the target model continuously improve their respective capabilities in interaction.
arXiv Detail & Related papers (2025-05-22T08:22:57Z) - Adversarial Reasoning at Jailbreaking Time [49.70772424278124]
We develop an adversarial reasoning approach to automatic jailbreaking via test-time computation.<n>Our approach introduces a new paradigm in understanding LLM vulnerabilities, laying the foundation for the development of more robust and trustworthy AI systems.
arXiv Detail & Related papers (2025-02-03T18:59:01Z) - Robust LLM safeguarding via refusal feature adversarial training [15.76605079209956]
Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful responses.<n>We propose Refusal Feature Adrial Training (ReFAT), a novel algorithm that efficiently performs adversarial training.<n>Experiment results show that ReFAT significantly improves the robustness of three popular LLMs against a wide range of adversarial attacks.
arXiv Detail & Related papers (2024-09-30T08:41:39Z) - Efficient Adversarial Training in LLMs with Continuous Attacks [99.5882845458567]
Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails.
We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses.
C-AdvIPO is an adversarial variant of IPO that does not require utility data for adversarially robust alignment.
arXiv Detail & Related papers (2024-05-24T14:20:09Z) - Towards Robust Federated Learning via Logits Calibration on Non-IID Data [49.286558007937856]
Federated learning (FL) is a privacy-preserving distributed management framework based on collaborative model training of distributed devices in edge networks.
Recent studies have shown that FL is vulnerable to adversarial examples, leading to a significant drop in its performance.
In this work, we adopt the adversarial training (AT) framework to improve the robustness of FL models against adversarial example (AE) attacks.
arXiv Detail & Related papers (2024-03-05T09:18:29Z) - Effective Targeted Attacks for Adversarial Self-Supervised Learning [58.14233572578723]
unsupervised adversarial training (AT) has been highlighted as a means of achieving robustness in models without any label information.
We propose a novel positive mining for targeted adversarial attack to generate effective adversaries for adversarial SSL frameworks.
Our method demonstrates significant enhancements in robustness when applied to non-contrastive SSL frameworks, and less but consistent robustness improvements with contrastive SSL frameworks.
arXiv Detail & Related papers (2022-10-19T11:43:39Z) - RelaxLoss: Defending Membership Inference Attacks without Losing Utility [68.48117818874155]
We propose a novel training framework based on a relaxed loss with a more achievable learning target.
RelaxLoss is applicable to any classification model with added benefits of easy implementation and negligible overhead.
Our approach consistently outperforms state-of-the-art defense mechanisms in terms of resilience against MIAs.
arXiv Detail & Related papers (2022-07-12T19:34:47Z) - Self-Progressing Robust Training [146.8337017922058]
Current robust training methods such as adversarial training explicitly uses an "attack" to generate adversarial examples.
We propose a new framework called SPROUT, self-progressing robust training.
Our results shed new light on scalable, effective and attack-independent robust training methods.
arXiv Detail & Related papers (2020-12-22T00:45:24Z) - FAT: Federated Adversarial Training [5.287156503763459]
Federated learning (FL) is one of the most important paradigms addressing privacy and data governance issues in machine learning (ML)
We take the first known steps towards federated adversarial training (FAT) combining both methods to reduce the threat of evasion during inference while preserving the data privacy during training.
arXiv Detail & Related papers (2020-12-03T09:47:47Z) - Membership Inference Attacks and Defenses in Classification Models [19.498313593713043]
We study the membership inference (MI) attack against classifiers.
We find that a model's vulnerability to MI attacks is tightly related to the generalization gap.
We propose a defense against MI attacks that aims to close the gap by intentionally reducing the training accuracy.
arXiv Detail & Related papers (2020-02-27T12:35:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.