Fugu-MT 論文翻訳(概要): Systematic Scaling Analysis of Jailbreak Attacks in Large Language Models

論文の概要: Systematic Scaling Analysis of Jailbreak Attacks in Large Language Models

arxiv url: http://arxiv.org/abs/2603.11149v1
Date: Wed, 11 Mar 2026 17:38:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:25.548137
Title: Systematic Scaling Analysis of Jailbreak Attacks in Large Language Models
Title（参考訳）: 大規模言語モデルにおけるジェイルブレイク攻撃の系統的スケーリング解析
Authors: Xiangwen Wang, Ananth Balashankar, Varun Chandrasekaran,
Abstract要約: 大規模な言語モデルは、Jailbreak攻撃に対して脆弱なままですが、ジェイルブレイクの成功が、メソッド、モデルファミリー、害タイプを越えて攻撃者の努力によってどのようにスケールするかに関して、体系的な理解はいまだにありません。我々は、各攻撃を計算バウンド最適化手順として扱い、共有FLOPs軸の進捗を測定することにより、ジェイルブレイクのスケーリング法フレームワークを開始する。組織的評価は、最適化に基づく攻撃、自己抑制促進、サンプリングに基づく選択、遺伝的最適化を含む4つの代表的なジェイルブレイクパラダイムにまたがる。
参考スコア（独自算出の注目度）: 15.425738252512362
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models remain vulnerable to jailbreak attacks, yet we still lack a systematic understanding of how jailbreak success scales with attacker effort across methods, model families, and harm types. We initiate a scaling-law framework for jailbreaks by treating each attack as a compute-bounded optimization procedure and measuring progress on a shared FLOPs axis. Our systematic evaluation spans four representative jailbreak paradigms, covering optimization-based attacks, self-refinement prompting, sampling-based selection, and genetic optimization, across multiple model families and scales on a diverse set of harmful goals. We investigate scaling laws that relate attacker budget to attack success score by fitting a simple saturating exponential function to FLOPs--success trajectories, and we derive comparable efficiency summaries from the fitted curves. Empirically, prompting-based paradigms tend to be the most compute-efficient compared to optimization-based methods. To explain this gap, we cast prompt-based updates into an optimization view and show via a same-state comparison that prompt-based attacks more effectively optimize in prompt space. We also show that attacks occupy distinct success--stealthiness operating points with prompting-based methods occupying the high-success, high-stealth region. Finally, we find that vulnerability is strongly goal-dependent: harms involving misinformation are typically easier to elicit than other non-misinformation harms.
Abstract（参考訳）: 大規模な言語モデルは、Jailbreak攻撃に対して脆弱なままですが、ジェイルブレイクの成功がメソッド、モデルファミリー、害タイプを越えて攻撃者によってどのようにスケールされるのか、体系的な理解はいまだにありません。我々は、各攻撃を計算バウンド最適化手順として扱い、共有FLOPs軸の進捗を測定することにより、ジェイルブレイクのスケーリング法フレームワークを開始する。組織的評価は、最適化に基づく攻撃、自己修復促進、サンプリングに基づく選択、遺伝的最適化の4つの代表的なジェイルブレイクパラダイムにまたがる。本研究では, FLOPs-success trajectories に単純な飽和指数関数を組み込むことにより, 攻撃者予算と攻撃成功スコアを関連づけるスケーリング法について検討し, 適合曲線から同等の効率の要約を導出する。経験的に、プロンプトベースのパラダイムは最適化ベースの手法と比較して最も計算効率が高い傾向にある。このギャップを説明するために、我々はプロンプトベースの更新を最適化ビューに投入し、プロンプトベースの攻撃をより効果的にプロンプト空間で最適化する同状態比較を通して示す。また、攻撃が成功率の異なる運用ポイントを占めることを示し、高い成功率、高い利益率の領域を占拠するプロンプトベースの方法を示す。最後に、脆弱性は目標に依存している。誤情報を含む害は、通常、他の非誤情報害よりも容易に引き出すことができる。

論文の概要: Systematic Scaling Analysis of Jailbreak Attacks in Large Language Models

関連論文リスト