Fugu-MT 論文翻訳(概要): Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models

論文の概要: Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models

arxiv url: http://arxiv.org/abs/2604.08557v2
Date: Mon, 13 Apr 2026 05:20:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-19 19:09:11.409764
Title: Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models
Title（参考訳）: Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models
Authors: Arth Singh,
Abstract要約: 拡散言語モデル(dLLM)における安全性の整合性は、単一の負荷を持つ仮定に依存している。コミットされた拒絶トークンを再マッシングし,短い肯定的接頭辞を注入することにより,HarmBench上で74～82%のASRが得られることを示す。我々はこの攻撃をTrajHijackと呼び、これはdLLMに対する最初の軌道レベルの攻撃であり、計算を必要とせず、SFTモデルと優先最適化(VRPO)モデルにまたがって一般化する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Safety alignment in diffusion language models (dLLMs) relies on a single load-bearing assumption: that committed tokens are permanent. We show that violating this assumption, by re-masking committed refusal tokens and injecting a short affirmative prefix, achieves 74-82% ASR on HarmBench across all three publicly available safety-tuned dLLMs, rising to 92-98% with a generic 8-token compliance prefix. We call this attack TrajHijack; it is the first trajectory-level attack on dLLMs, requires no gradient computation, and generalizes across SFT and preference-optimized (VRPO) models. Three findings emerge. First, the vulnerability is irreducibly two-component: re-masking alone (4.4%) and prefix alone (5.7%) both fail. Second, gradient optimization via a differentiable Gumbel-softmax chain consistently degrades ASR (41.5% vs. 76.1%), because continuous perturbations push token distributions off-manifold. Third, A2D (the strongest published dLLM defense) is more vulnerable to TrajHijack (89.9%) than the undefended model (76.1%): its silent-refusal training removes the contextual resistance that trajectory-level attacks must overcome, an effect we call the Defense Inversion Effect.
Abstract（参考訳）: 拡散言語モデル(dLLMs)における安全性の整合性は、コミットトークンが永続的であるという単一の負荷を持つ仮定に依存している。この仮定に違反し、コミットされた拒絶トークンを再マッシングし、短い肯定的な接頭辞を注入することにより、HarmBench上で3つの公開安全チューニングされたdLLMで74～82%のASRを達成し、一般的な8-tokenコンプライアンスプレフィックスで92～98%まで上昇した。我々はこの攻撃をTrajHijackと呼び、これはdLLMに対する最初の軌道レベル攻撃であり、勾配計算を必要とせず、SFTモデルと優先最適化(VRPO)モデルにまたがって一般化する。 3つの発見がある。第一に、脆弱性は既約2成分であり、再マスキング単独(4.4%)とプレフィックス単独(5.7%)の両方が失敗する。第二に、微分可能なガンベル-ソフトマックス連鎖による勾配最適化は、連続的な摂動がトークンの分布をオフマニフォールドにするので、一貫してASR(41.5%対76.1%)を劣化させる。第3に、A2D(最も強力なdLLM防衛)は、防御されていないモデル(76.1%)よりもTrajHijack(89.9%)のほうが脆弱である。

論文の概要: Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models

関連論文リスト