Related papers: DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models

DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models

URL: http://arxiv.org/abs/2509.24296v1
Date: Mon, 29 Sep 2025 05:17:10 GMT
Title: DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models
Authors: Zherui Li, Zheng Nie, Zhenhong Zhou, Yufei Guo, Yue Liu, Yitong Zhang, Yu Cheng, Qingsong Wen, Kun Wang, Jiaheng Zhang,
Abstract summary: We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
Score: 50.21378052667732
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid advancement of Diffusion Large Language Models (dLLMs) introduces unprecedented vulnerabilities that are fundamentally distinct from Autoregressive LLMs, stemming from their iterative and parallel generation mechanisms. In this paper, we conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics. Experimental results reveal a harmful bias inherent in the standard greedy remasking strategy and identify a critical phenomenon we term Denoising-path Dependence, where the safety of early-stage tokens decisively influences the final output. These findings also indicate that while current decoding strategies constitute a significant vulnerability, dLLMs possess a substantial intrinsic safety potential. To unlock this potential, we propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach: Stochastic Annealing Remasking dynamically introduces controlled randomness to mitigate greedy selection bias, while Block-level Audit and Repair exploits internal model representations for autonomous risk detection and guided correction. Comprehensive experiments on four dLLMs demonstrate DiffuGuard's exceptional effectiveness, reducing Attack Success Rate against six diverse jailbreak methods from 47.9% to 14.7% while preserving model utility and efficiency. Our code is available at: https://github.com/niez233/DiffuGuard.

Related papers

A Fragile Guardrail: Diffusion LLM's Safety Blessing and Its Failure Mode [51.43498132808724]
We show that Diffusion large language models (D-LLMs) have intrinsic robustness against jailbreak attacks.<n>We identify a simple yet effective failure mode, termed context nesting, where harmful requests are embedded within structured benign contexts.<n>We show that this simple strategy is sufficient to bypass D-LLMs' safety blessing, achieving state-of-the-art attack success rates.
arXiv Detail & Related papers (2026-01-30T23:08:14Z)
Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models [50.91504059485288]
We propose a framework that identifies safety-critical attention heads through global optimization over all heads simultaneously.<n>We develop a novel inference-time white-box jailbreak method that exploits the identified safety vectors through activation repatching.
arXiv Detail & Related papers (2026-01-22T09:32:43Z)
Evaluating Adversarial Vulnerabilities in Modern Large Language Models [0.0]
This paper presents a comparative analysis of the susceptibility to jailbreak attacks for two leading publicly available Large Language Models (LLMs)<n>The research utilized two main bypass strategies:'self-bypass' and 'cross-bypass'<n>The success of the attack was determined by the generation of disallowed content, with successful jailbreaks assigned a severity score.
arXiv Detail & Related papers (2025-11-21T01:23:56Z)
Toward Safer Diffusion Language Models: Discovery and Mitigation of Priming Vulnerability [5.650647159993238]
Diffusion language models (DLMs) generate tokens in parallel through iterative denoising.<n>In this paper, we reveal that DLMs have a critical vulnerability stemming from their iterative denoising process.<n>We propose a novel safety alignment method tailored to DLMs that trains models to generate safe responses from contaminated intermediate states.
arXiv Detail & Related papers (2025-10-01T06:35:23Z)
SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks [29.963044242980345]
Jailbreak attacks pose a serious threat to the safety of Large Language Models.<n>We propose SafeLLM, a novel unlearning-based defense framework.<n>We show that SafeLLM substantially reduces attack success rates while maintaining high general-purpose performance.
arXiv Detail & Related papers (2025-08-21T02:39:14Z)
Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models: A Unified and Accurate Approach [22.248911000455706]
We propose a novel unsupervised framework that formulates jailbreak detection as anomaly detection.<n>LoD achieves state-of-the-art performance, with an average AUROC of 0.9951 and an improvement of up to 38.89% in the minimum AUROC over the strongest baselines.
arXiv Detail & Related papers (2025-08-08T16:13:28Z)
Jailbreaking Large Language Diffusion Models: Revealing Hidden Safety Flaws in Diffusion-Based Text Generation [7.857304417560443]
We present a PArallel Decoding jailbreak (PAD) for diffusion-based language models.<n>PAD achieves jailbreak attack success rates by 97%, revealing significant safety vulnerabilities.<n>Compared to autoregressive Large Language Models (LLMs), LLDMs increase the harmful generation speed by 2x.
arXiv Detail & Related papers (2025-07-25T12:53:03Z)
Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs [54.90315421117162]
We propose a novel poisoning method via completely harmless data.<n>Inspired by the causal reasoning in auto-regressive LLMs, we aim to establish robust associations between triggers and an affirmative response prefix.<n>Our method effectively injects backdoors into various LLMs for harmful content generation, even under the detection of powerful guardrail models.
arXiv Detail & Related papers (2025-05-23T08:13:59Z)
Improving LLM Safety Alignment with Dual-Objective Optimization [65.41451412400609]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks.<n>We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z)
LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models [21.02295266675853]
We propose a novel black-box jailbreak attack method, Analyzing-based Jailbreak (ABJ)<n>ABJ comprises two independent attack paths, which exploit the model's multimodal reasoning capabilities to bypass safety mechanisms.<n>Our work reveals a new type of safety risk and highlights the urgent need to mitigate implicit vulnerabilities in the model's reasoning process.
arXiv Detail & Related papers (2024-07-23T06:14:41Z)
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models [57.5404308854535]
Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions. We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space. Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations.
arXiv Detail & Related papers (2024-06-24T19:29:47Z)
Watch the Watcher! Backdoor Attacks on Security-Enhancing Diffusion Models [65.30406788716104]
This work investigates the vulnerabilities of security-enhancing diffusion models. We demonstrate that these models are highly susceptible to DIFF2, a simple yet effective backdoor attack. Case studies show that DIFF2 can significantly reduce both post-purification and certified accuracy across benchmark datasets and models.
arXiv Detail & Related papers (2024-06-14T02:39:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.