Related papers: MixAT: Combining Continuous and Discrete Adversarial Training for LLMs

MixAT: Combining Continuous and Discrete Adversarial Training for LLMs

URL: http://arxiv.org/abs/2505.16947v2
Date: Tue, 28 Oct 2025 09:41:22 GMT
Title: MixAT: Combining Continuous and Discrete Adversarial Training for LLMs
Authors: Csaba Dékány, Stefan Balauca, Robin Staab, Dimitar I. Dimitrov, Martin Vechev,
Abstract summary: MixAT is a novel method that combines stronger discrete and faster continuous attacks during training.<n>We show MixAT achieves substantially better robustness (ALO-ASR 20%) compared to prior defenses.<n>Our results demonstrate that MixAT's discrete-continuous defense offers a principled and superior-accuracy tradeoff with minimal computational overhead.
Score: 10.570402333857261
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Despite recent efforts in Large Language Model (LLM) safety and alignment, current adversarial attacks on frontier LLMs can still consistently force harmful generations. Although adversarial training has been widely studied and shown to significantly improve the robustness of traditional machine learning models, its strengths and weaknesses in the context of LLMs are less understood. Specifically, while existing discrete adversarial attacks are effective at producing harmful content, training LLMs with concrete adversarial prompts is often computationally expensive, leading to reliance on continuous relaxations. At the same time, despite their effectiveness and generalization capabilities, training with continuous perturbations does not always capture the full spectrum of vulnerabilities exploited by discrete attacks. In this work, we aim to bridge this gap by introducing MixAT, a novel method that combines stronger discrete and faster continuous attacks during training. We rigorously evaluate MixAT across a wide spectrum of state-of-the-art attacks, proposing the At Least One Attack Success Rate (ALO-ASR) metric to capture the worst-case vulnerability of models. We show MixAT achieves substantially better robustness (ALO-ASR < 20%) compared to prior defenses (ALO-ASR > 50%), while maintaining a runtime comparable to methods based on continuous relaxations. We further analyze MixAT in realistic deployment settings, exploring how chat templates, quantization, low-rank adapters, and temperature affect both adversarial training and evaluation, revealing additional blind spots in current methodologies. Our results demonstrate that MixAT's discrete-continuous defense offers a principled and superior robustness-accuracy tradeoff with minimal computational overhead, highlighting its promise for building safer LLMs. We provide our code and models at https://github.com/insait-institute/MixAT.

Related papers

Closing the Distribution Gap in Adversarial Training for LLMs [50.33186122381395]
Adversarial training for LLMs is one of the most promising methods to reliably improve robustness against adversaries.<n>We argue that current adversarial training algorithms minimize adversarial loss on their training set but inadequately cover the data distribution, resulting in vulnerability to seemingly simple attacks.<n>We propose Distributional Adversarial Training, DAT, to approximate the true joint distribution of prompts and responses.
arXiv Detail & Related papers (2026-02-16T22:34:52Z)
Deep Leakage with Generative Flow Matching Denoiser [54.05993847488204]
We introduce a new deep leakage (DL) attack that integrates a generative Flow Matching (FM) prior into the reconstruction process.<n>Our approach consistently outperforms state-of-the-art attacks across pixel-level, perceptual, and feature-based similarity metrics.
arXiv Detail & Related papers (2026-01-21T14:51:01Z)
MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming [38.25556351567948]
textbfMulti-textbfTurn textbfSafety textbfAlignment (ourapproach) framework for securing large language models.<n>Red-team model learns about thought-guided multi-round jailbreak attacks to generate adversarial prompts.<n> adversarial iterative optimization stage, the red-team model and the target model continuously improve their respective capabilities in interaction.
arXiv Detail & Related papers (2025-05-22T08:22:57Z)
Adversarial Reasoning at Jailbreaking Time [49.70772424278124]
We develop an adversarial reasoning approach to automatic jailbreaking via test-time computation.<n>Our approach introduces a new paradigm in understanding LLM vulnerabilities, laying the foundation for the development of more robust and trustworthy AI systems.
arXiv Detail & Related papers (2025-02-03T18:59:01Z)
Adversarial Vulnerabilities in Large Language Models for Time Series Forecasting [14.579802892916101]
Large Language Models (LLMs) have recently demonstrated significant potential in time series forecasting.<n>However, their robustness and reliability in real-world applications remain under-explored.<n>We introduce a targeted adversarial attack framework for LLM-based time series forecasting.
arXiv Detail & Related papers (2024-12-11T04:53:15Z)
Robust LLM safeguarding via refusal feature adversarial training [15.76605079209956]
Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful responses.<n>We propose Refusal Feature Adrial Training (ReFAT), a novel algorithm that efficiently performs adversarial training.<n>Experiment results show that ReFAT significantly improves the robustness of three popular LLMs against a wide range of adversarial attacks.
arXiv Detail & Related papers (2024-09-30T08:41:39Z)
Efficient Adversarial Training in LLMs with Continuous Attacks [99.5882845458567]
Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails. We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses. C-AdvIPO is an adversarial variant of IPO that does not require utility data for adversarially robust alignment.
arXiv Detail & Related papers (2024-05-24T14:20:09Z)
Towards Robust Federated Learning via Logits Calibration on Non-IID Data [49.286558007937856]
Federated learning (FL) is a privacy-preserving distributed management framework based on collaborative model training of distributed devices in edge networks. Recent studies have shown that FL is vulnerable to adversarial examples, leading to a significant drop in its performance. In this work, we adopt the adversarial training (AT) framework to improve the robustness of FL models against adversarial example (AE) attacks.
arXiv Detail & Related papers (2024-03-05T09:18:29Z)
Effective Targeted Attacks for Adversarial Self-Supervised Learning [58.14233572578723]
unsupervised adversarial training (AT) has been highlighted as a means of achieving robustness in models without any label information. We propose a novel positive mining for targeted adversarial attack to generate effective adversaries for adversarial SSL frameworks. Our method demonstrates significant enhancements in robustness when applied to non-contrastive SSL frameworks, and less but consistent robustness improvements with contrastive SSL frameworks.
arXiv Detail & Related papers (2022-10-19T11:43:39Z)
RelaxLoss: Defending Membership Inference Attacks without Losing Utility [68.48117818874155]
We propose a novel training framework based on a relaxed loss with a more achievable learning target. RelaxLoss is applicable to any classification model with added benefits of easy implementation and negligible overhead. Our approach consistently outperforms state-of-the-art defense mechanisms in terms of resilience against MIAs.
arXiv Detail & Related papers (2022-07-12T19:34:47Z)
Self-Progressing Robust Training [146.8337017922058]
Current robust training methods such as adversarial training explicitly uses an "attack" to generate adversarial examples. We propose a new framework called SPROUT, self-progressing robust training. Our results shed new light on scalable, effective and attack-independent robust training methods.
arXiv Detail & Related papers (2020-12-22T00:45:24Z)
FAT: Federated Adversarial Training [5.287156503763459]
Federated learning (FL) is one of the most important paradigms addressing privacy and data governance issues in machine learning (ML) We take the first known steps towards federated adversarial training (FAT) combining both methods to reduce the threat of evasion during inference while preserving the data privacy during training.
arXiv Detail & Related papers (2020-12-03T09:47:47Z)
Robust Pre-Training by Adversarial Contrastive Learning [120.33706897927391]
Recent work has shown that, when integrated with adversarial training, self-supervised pre-training can lead to state-of-the-art robustness. We improve robustness-aware self-supervised pre-training by learning representations consistent under both data augmentations and adversarial perturbations.
arXiv Detail & Related papers (2020-10-26T04:44:43Z)
Membership Inference Attacks and Defenses in Classification Models [19.498313593713043]
We study the membership inference (MI) attack against classifiers. We find that a model's vulnerability to MI attacks is tightly related to the generalization gap. We propose a defense against MI attacks that aims to close the gap by intentionally reducing the training accuracy.
arXiv Detail & Related papers (2020-02-27T12:35:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.