Fugu-MT 論文翻訳(概要): D-Judge: Disrupting Multi-Turn Jailbreaks using Semantics-Preserving Output Rewriting

論文の概要: D-Judge: Disrupting Multi-Turn Jailbreaks using Semantics-Preserving Output Rewriting

arxiv url: http://arxiv.org/abs/2606.02640v1
Date: Sun, 31 May 2026 06:40:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-03 22:00:04.482611
Title: D-Judge: Disrupting Multi-Turn Jailbreaks using Semantics-Preserving Output Rewriting
Title（参考訳）: D-Judge: セマンティックスによるマルチターンジェイルブレークの破壊 -出力リライトの保存-
Authors: Huanli Gong, Zhipeng Wei, Yu Fu, Haz Sameen Shahgir, Ananya Gupta, Yue Dong, N. Benjamin Erichson,
Abstract要約: マルチターンジェイルブレイク攻撃は、有害な目標に向けて反復的にプロンプトを洗練するために補助裁判官モデルからのフィードバックを利用する。本稿では,D-Judgeについて紹介する。 D-Judgeは、良質なベンチマークの性能を維持しつつ、最先端のマルチターンジェイルブレイクの成功率を低下させることを示す。
参考スコア（独自算出の注目度）: 18.812968910221823
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-turn jailbreak attacks pose a growing threat to large language model (LLM) safety because they exploit feedback from auxiliary judge models to iteratively refine prompts toward harmful goals. Existing defenses largely detect or block unsafe content at individual turns or at the final response, leaving the judge-driven refinement loop intact and allowing attackers to extract informative feedback from intermediate interactions. We introduce D-Judge, a semantics-preserving output rewriting defense that intervenes directly in this loop by rewriting the victim LLM's responses before they are evaluated by the attacker's judge. By misaligning the judge's feedback signal without changing the meaning of the original response, D-Judge derails the attacker's prompt-refinement process, causing subsequent queries to be optimized against a distorted signal of attack progress. To improve D-Judge's ability to produce such rewrites, we construct a dataset of semantically equivalent response pairs that induce different judge-assigned harmfulness scores, and use it for supervised fine-tuning followed by direct preference optimization. Experiments on HarmBench show that D-Judge reduces the success rate of state-of-the-art multi-turn jailbreaks while preserving performance on benign benchmarks.
Abstract（参考訳）: マルチターンジェイルブレイク攻撃は、有害な目標に向けて反復的にプロンプトを洗練するために補助判断モデルからのフィードバックを利用するため、大きな言語モデル(LLM)の安全性に対する脅威が増大する。既存の防御は、個々のターンや最終応答で安全でないコンテンツを検出またはブロックし、裁判官主導の洗練ループをそのまま残し、攻撃者は中間的相互作用から情報的フィードバックを抽出することができる。我々は,攻撃者の判断により評価される前に,被害者のLSMの応答を書き換えることにより,このループ内で直接介入する意味保存型出力書き換えディフェンスであるD-Judgeを紹介する。元の応答の意味を変えることなく、裁判官のフィードバック信号を誤調整することにより、D-Judgeは攻撃者の迅速な抑制プロセスを脱線させ、その後のクエリを攻撃進行の歪んだ信号に対して最適化する。そこで本稿では,D-Judgeの書き直し能力を向上させるために,異なる判断を付与した有害度スコアを誘導する意味論的に等価な応答ペアのデータセットを構築し,それを教師付き微調整および直接選好最適化に利用する。 HarmBenchの実験によると、D-Judgeは、良質なベンチマークのパフォーマンスを維持しながら、最先端のマルチターンジェイルブレイクの成功率を低下させる。

論文の概要: D-Judge: Disrupting Multi-Turn Jailbreaks using Semantics-Preserving Output Rewriting

関連論文リスト