Fugu-MT 論文翻訳(概要): Black-box, Adaptive, Efficient, Transferable, Harmful, Applicable... Attacks Are All You Need to Break LLMs

論文の概要: Black-box, Adaptive, Efficient, Transferable, Harmful, Applicable... Attacks Are All You Need to Break LLMs

arxiv url: http://arxiv.org/abs/2606.03647v1
Date: Tue, 02 Jun 2026 13:39:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-03 22:00:05.035291
Title: Black-box, Adaptive, Efficient, Transferable, Harmful, Applicable... Attacks Are All You Need to Break LLMs
Title（参考訳）: ブラックボックス、適応性、効率性、転送性、ハームフル、適用性...LLMを壊すのに必要なのはアタック
Authors: Vincent Limbach, Jonas Dornbusch, David Lüdke, Stephan Günnemann, Leo Schwinn,
Abstract要約: 欠陥のある攻撃設計は、ロバスト性の推定を増大させ、デプロイメントのリスク評価と防御比較を信頼できないものにすることができる。 Indirect Harm Optimization (IHO) は、有害判定に対する反復的選好最適化によって訓練された、マスク付き拡散言語モデルアタッカーである。以上の結果から,IHOは従来,信頼性を向上した標準化されたジェイルブレイク評価への実践的な一歩と位置づけた。
参考スコア（独自算出の注目度）: 47.53613000473204
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Accurately evaluating adversarial robustness is a longstanding challenge. A flawed attack design can inflate robustness estimates, making deployment risk assessment and defense comparison unreliable. Historically, standardized attacks such as AutoAttack have largely resolved this for image classifiers, providing a reliable evaluation baseline for systematic comparison across defenses. However, no equivalent exists for LLM jailbreak evaluation yet, where designing such an attack is considerably more difficult. A reliable attack must, among other things, be black-box compatible, applicable to arbitrary defense pipelines, and efficient, which no existing method jointly satisfies. We introduce Indirect Harm Optimization (IHO), a masked diffusion language model attacker trained via iterative preference optimization against a harmfulness judge, requiring only black-box access to the target. The same method can be used without modification as a strong adaptive attack on individual behaviors, or as an efficient amortized policy that transfers to held-out behaviors and unseen target models without fine-tuning. Even against layered defenses, such as a Circuit Breaker-trained model combined with an auxiliary detector, IHO improves attack success considerably over state-of-the-art approaches, without any defense-specific adaptation. Our results position IHO as a practical step toward the kind of standardized jailbreak evaluation that has improved reliability in the past. Code and models are available on GitHub and Hugging Face.
Abstract（参考訳）: 敵の堅牢性を正確に評価することは、長年にわたる課題である。欠陥のある攻撃設計は、ロバスト性の推定を増大させ、デプロイメントのリスク評価と防御比較を信頼できないものにすることができる。歴史的に、AutoAttackのような標準化された攻撃は、画像分類器に対してこれを大々的に解決し、防御の体系的比較のための信頼性の高い評価基準を提供する。しかし、LLMのジェイルブレイク評価には、そのような攻撃を設計することがかなり困難であるような同等のものはまだ存在しない。信頼できる攻撃は、ブラックボックス互換で、任意の防御パイプラインに適用でき、効率的で、既存の方法では満足できない。 Indirect Harm Optimization (IHO) は、標的へのブラックボックスアクセスのみを必要とする有害判定に対して反復的優先最適化によって訓練された、マスク付き拡散言語モデルアタッカーである。同じ方法は、個々の行動に対する強い適応攻撃や、ホールドアウト行動や未確認ターゲットモデルに微調整なしで転送する効率的な償却ポリシーとして、変更することなく使用できる。サーキットブレーカー(Circuit Breaker)を訓練したモデルと補助検出器の組み合わせのような層状防御に対しても、IHOは防御固有の適応を伴わずに、最先端のアプローチよりも攻撃の成功を大幅に改善する。以上の結果から,IHOは従来,信頼性を向上した標準化されたジェイルブレイク評価への実践的な一歩と位置づけた。コードとモデルはGitHubとHugging Faceで入手できる。

論文の概要: Black-box, Adaptive, Efficient, Transferable, Harmful, Applicable... Attacks Are All You Need to Break LLMs

関連論文リスト