Fugu-MT 論文翻訳(概要): Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover

論文の概要: Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover

arxiv url: http://arxiv.org/abs/2603.11331v1
Date: Wed, 11 Mar 2026 21:48:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:25.663791
Title: Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover
Title（参考訳）: 大規模言語モデルのジェイルブレイクスケーリング法則:多項式-指数交叉
Authors: Indranil Halder, Annesya Banerjee, Cengiz Pehlevan,
Abstract要約: アドリアックは安全に整合した大きな言語モデルを安全でない行動に向けて確実に操ることができる。本稿では,レプリカ対称性を破るシステムで動作するスピングラスシステムの観点から,プロキシ言語の理論的生成モデルを提案する。このフレームワーク内では、インジェクションによるインジェクションベースのジェイルブレイクを解析する。
参考スコア（独自算出の注目度）: 30.86966284669791
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Adversarial attacks can reliably steer safety-aligned large language models toward unsafe behavior. Empirically, we find that adversarial prompt-injection attacks can amplify attack success rate from the slow polynomial growth observed without injection to exponential growth with the number of inference-time samples. To explain this phenomenon, we propose a theoretical generative model of proxy language in terms of a spin-glass system operating in a replica-symmetry-breaking regime, where generations are drawn from the associated Gibbs measure and a subset of low-energy, size-biased clusters is designated unsafe. Within this framework, we analyze prompt injection-based jailbreaking. Short injected prompts correspond to a weak magnetic field aligned towards unsafe cluster centers and yield a power-law scaling of attack success rate with the number of inference-time samples, while long injected prompts, i.e., strong magnetic field, yield exponential scaling. We derive these behaviors analytically and confirm them empirically on large language models. This transition between two regimes is due to the appearance of an ordered phase in the spin chain under a strong magnetic field, which suggests that the injected jailbreak prompt enhances adversarial order in the language model.
Abstract（参考訳）: 敵対的攻撃は、安全と整合した大きな言語モデルを安全でない行動に向けて確実に操ることができる。実験により, インジェクションを伴わない速度の遅い多項式成長から, インジェクション時間サンプル数による指数的成長まで, 対向的インジェクション攻撃は, 攻撃成功率を増大させることができることがわかった。この現象を説明するために、レプリカ対称性を破るシステムで動作するスピングラスシステムを用いて、プロキシ言語の理論的生成モデルを提案し、関連するギブズ測度と低エネルギーでサイズに偏ったクラスタのサブセットから世代を抽出する。このフレームワーク内では、インジェクションによるインジェクションベースのジェイルブレイクを解析する。短いインジェクトプロンプトは、安全でないクラスター中心に向けて整列された弱い磁場に対応し、推論時間サンプルの数で攻撃成功率のパワー・ロー・スケーリングを発生させ、長いインジェクトプロンプト、すなわち強い磁場は指数的スケーリングをもたらす。本研究では,これらの振る舞いを解析的に導き,大規模言語モデル上で実証的に確認する。この2つの状態間の遷移は、強い磁場下でスピン鎖の秩序相が出現することによるものであり、これは、注入されたジェイルブレイクが言語モデルにおける逆順を促進させることを示唆している。

論文の概要: Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover

関連論文リスト