Fugu-MT 論文翻訳(概要): The Great Pretender: A Stochasticity Problem in LLM Jailbreak

論文の概要: The Great Pretender: A Stochasticity Problem in LLM Jailbreak

arxiv url: http://arxiv.org/abs/2605.14418v1
Date: Thu, 14 May 2026 06:05:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 21:45:34.653245
Title: The Great Pretender: A Stochasticity Problem in LLM Jailbreak
Title（参考訳）: LLMジェイルブレイクの確率問題
Authors: Jean-Philippe Monteuuis, Cong Chen, Jonathan Petit,
Abstract要約: 攻撃評価だけでなく,攻撃発生時の敵意の影響についても検討した。我々の評価フレームワークであるCAS-evalは、ジェイルブレイクプロンプトが複数の試みで成功する必要がある場合、攻撃が最大30ポイントのASR低下を達成できることを示している。
参考スコア（独自算出の注目度）: 4.092493997270006
License: http://creativecommons.org/licenses/by/4.0/
Abstract: "Oh-Oh, yes, I'm the great pretender. Pretending that I'm doing well. My need is such, I pretend too much..." summarizes the state in the area of jailbreak creation and evaluation. You find this method to generate adversarial attacks proposed by a reputable institution (e.g., BoN from Anthropic or Crescendo from Microsoft Research). However, this method does not deliver on the promise claimed in the paper despite having top ASR scores against industry-grade LLMs. You successfully generate the jailbreak prompts against your target (open) model. However, the generated jailbreak prompt works against the target model with a 50% consecutive success rate (5 out of 10 attempts) despite having an 80% ASR (on paper) on the latest closed-source model (with a guardrail system)! This observation leads us to think. First, Attack Success Rate (ASR), the primary metric for LLM jailbreak benchmarking, is not a stable quantity. Second, published ASR numbers are therefore systematically inflated and incomparable across papers. Therefore, we wonder "Why a successful jailbreak prompt does not perform consistently well against a target model on which the prompts have been optimized?". To answer this question, we study the impact of stochasticity not only during attack evaluation but also during attack generation. Our evaluation includes several jailbreak attacks, models (different sizes and providers), and judges. In addition, we propose a new metric and two new frameworks (CAS-eval and CAS-gen). Our evaluation framework, CAS-eval, shows that an attack can have an ASR drop of up to 30 percentage points when a jailbreak prompt needs to succeed on more than one attempt. Thankfully, our attack generation framework (CAS-gen) improves previous jailbreak methods and helps them recover this loss of 30 percentage points!
Abstract（参考訳）: 「ああ、ああ、私は偉大なふりをする人です。私は元気にやっています。そんなふりをしすぎます。」とジェイルブレイクの創造と評価の領域の状況を要約します。この方法では、信頼できる機関によって提案された敵攻撃(例えば、Microsoft ResearchのAnthropicやCrescendoのBoNなど)を生成することができる。しかし、この手法は業界グレードのLSMに対してトップのASRスコアを持つにもかかわらず、論文で主張されている約束を果たさない。ターゲット(オープン)モデルに対して、ジェイルブレイクプロンプトをうまく生成します。しかし、生成されたジェイルブレイクは、最新のクローズドソースモデル(ガードレールシステム付き)に80%のASR(紙上)を持つにもかかわらず、50%連続の成功率(10回中5回)でターゲットモデルに対して迅速に動作する。この観察は私たちが考えるのに繋がる。第一に、LLMジェイルブレイクベンチマークの主要な指標であるアタック成功率(ASR)は、安定した量ではない。第二に、公表されたASR番号は体系的に膨らませられ、論文間で比較できない。したがって、なぜ成功したジェイルブレイクプロンプトが、プロンプトが最適化されたターゲットモデルに対して一貫して機能しないのか? そこで本研究では,攻撃評価における確率性の影響だけでなく,攻撃発生時の確率性への影響についても検討する。私たちの評価には、いくつかのジェイルブレイク攻撃、モデル(異なるサイズとプロバイダ)、および裁判官が含まれています。さらに、新しいメトリクスと2つの新しいフレームワーク(CAS-evalとCAS-gen)を提案する。我々の評価フレームワークであるCAS-evalは、ジェイルブレイクプロンプトが複数の試みで成功する必要がある場合、攻撃が最大30ポイントのASR低下を達成できることを示している。幸いなことに、攻撃生成フレームワーク(CAS-gen)は以前のジェイルブレイク手法を改善し、30ポイントの損失を回復するのに役立ちます。

論文の概要: The Great Pretender: A Stochasticity Problem in LLM Jailbreak

関連論文リスト