Fugu-MT 論文翻訳(概要): Jailbroken Frontier Models Retain Their Capabilities

論文の概要: Jailbroken Frontier Models Retain Their Capabilities

arxiv url: http://arxiv.org/abs/2605.00267v2
Date: Mon, 04 May 2026 18:25:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-06 14:45:21.224275
Title: Jailbroken Frontier Models Retain Their Capabilities
Title（参考訳）: 脱獄したフロンティアモデルの能力維持
Authors: Daniel Zhu, Zihan Wang, Xuchan Bao, Jerry Wei,
Abstract要約: Haiku 4.5からOpus 4.6までの5つのベンチマークで28のジェイルブレイクを評価した。 Haiku 4.5は、ジェイルブレイク時に平均33.1%のベンチマークパフォーマンスを失うのに対して、Opus 4.6は最大思考力では7.7%しか失われていない。私たちは、フロンティアモデルの安全ケースは、ジェイルブレイクによる有意義な能力低下に依存してはならないことを推奨します。
参考スコア（独自算出の注目度）: 13.528820984495658
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As language model safeguards become more robust, attackers are pushed toward developing increasingly complex jailbreaks. Prior work has found that this complexity imposes a "jailbreak tax" that degrades the target model's task performance. We show that this tax scales inversely with model capability and that the most advanced jailbreaks effectively yield no reduction in model capabilities. Evaluating 28 jailbreaks on five benchmarks across Claude models ranging in capability from Haiku 4.5 to Opus 4.6, we find Haiku 4.5 loses an average of 33.1% on benchmark performance when jailbroken, while Opus 4.6 at max thinking effort loses only 7.7%. We also observe that across all models, reasoning-heavy tasks display considerably more degradation than knowledge-recall tasks. Finally, Boundary Point Jailbreaking, currently the strongest jailbreak against deployed classifiers, achieves near-perfect classifier evasion with near-zero degradation across safeguarded models. We recommend that safety cases for frontier models should not rely on a meaningful capability degradation from jailbreaks.
Abstract（参考訳）: 言語モデルの保護がより堅牢になるにつれて、攻撃者はますます複雑なジェイルブレイクの開発に向かっている。以前の作業では、この複雑さがターゲットモデルのタスクパフォーマンスを低下させる"ジェイルブレイク税"を課すことが分かりました。この税は、モデル能力と逆スケールし、最も先進的なジェイルブレイクは、効果的にモデル能力の低下を生じさせないことを示す。 Haiku 4.5からOpus 4.6まで、クロードモデルの5つのベンチマークで28のジェイルブレイクを評価すると、Haiku 4.5はジェイルブレイク時にベンチマークパフォーマンスで平均33.1%を失うのに対し、最大思考のOpus 4.6は7.7%しか失われていない。また、すべてのモデルにおいて、推論に重きを置くタスクは、知識を呼び起こすタスクよりも大幅に劣化していることを示す。最後に、現在デプロイされた分類器に対して最強のジェイルブレイクであるBoundary Point Jailbreakingは、保護されたモデル間でほぼゼロに近い劣化を伴うほぼ完璧な分類器回避を実現している。私たちは、フロンティアモデルの安全ケースは、ジェイルブレイクによる有意義な能力低下に依存してはならないことを推奨します。

論文の概要: Jailbroken Frontier Models Retain Their Capabilities

関連論文リスト