Fugu-MT 論文翻訳(概要): MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models

論文の概要: MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models

arxiv url: http://arxiv.org/abs/2606.04027v1
Date: Mon, 01 Jun 2026 18:10:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-04 20:44:18.246863
Title: MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models
Title（参考訳）: MaskForge: 拡散大言語モデルのジェイルブレークに対する構造認識型アダプティブアタック
Authors: Yingzi Ma, Zhengyue Zhao, Xiaogeng Liu, Minhui Xue, Yue Zhao, Chaowei Xiao,
Abstract要約: MaskForgeは完全にブラックボックス対応の攻撃で、構造パターンのライブラリを最適化した検索としてdLLMのレッドチームを実行する。攻撃成功率は79.3%であり、最強のdLLMベースラインよりも17.6%向上している。
参考スコア（独自算出の注目度）: 53.05463623673949
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion large language models (dLLMs) generate text by iteratively denoising partially masked sequences under bidirectional context, exposing a safety surface distinct from autoregressive LLMs. Because mask tokens are native inputs and tokens are committed by confidence rather than position, harmful content can be induced through infilling and outside the monitored prefix. Existing jailbreaks either miss this native infill capability or rely on low-diversity mask-bearing templates applied uniformly across goals, with little structural adaptation or accumulated attack experience. We propose MaskForge, a fully black-box adaptive attack that casts dLLM red-teaming as optimized search over a growing library of structural patterns. MaskForge abstracts successful attempts into reusable schemas, selects goal-compatible patterns with a UCB bandit, and invokes a scorer-guided fallback when the current library fails. Successful attempts are distilled back into the pattern library, enabling experience to accumulate across goals. Across five public dLLMs and three benchmarks, MaskForge achieves an average attack success rate of 79.3%, a 17.6% relative improvement over the strongest competing dLLM baseline. The matured pattern library further transfers to AdvBench without any updates, achieving a 88.2% attack success rate and a 67% relative improvement over the strongest competing baseline.
Abstract（参考訳）: 拡散大言語モデル(dLLM)は、双方向コンテキスト下で部分的にマスキングされたシーケンスを反復的にデノベートすることでテキストを生成し、自己回帰型LLMとは異なる安全性面を露呈する。マスクトークンはネイティブ入力であり、トークンは位置よりも信頼によってコミットされるため、監視されたプレフィックスの内外から有害なコンテンツが誘導される。既存のジェイルブレイクは、このネイティブな埋め込み能力を見逃すか、または、目標に対して一様に適用された低多様性マスク付きテンプレートに依存し、構造的な適応や攻撃経験の蓄積がほとんどない。我々は,構造パターンのライブラリを最適化した検索として,dLLMのレッドチームを実行する,完全にブラックボックス適応型攻撃であるMaskForgeを提案する。 MaskForgeは再利用可能なスキーマの試行を抽象化し、 UCB bandit でゴール互換のパターンを選択し、現在のライブラリが失敗するとスコアラー誘導のフォールバックを起動する。成功した試みはパターンライブラリに戻され、経験が目標を越えて蓄積される。 5つの公開dLLMと3つのベンチマークで、MaskForgeは79.3%の平均攻撃成功率を達成した。成熟したパターンライブラリはさらにAdvBenchにアップデートすることなく移行し、88.2%の攻撃成功率と67%の相対的な改善を達成した。

論文の概要: MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models

関連論文リスト