Fugu-MT 論文翻訳(概要): Chain-of-Thought Hijacking

論文の概要: Chain-of-Thought Hijacking

arxiv url: http://arxiv.org/abs/2510.26418v1
Date: Thu, 30 Oct 2025 12:10:03 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-31 16:05:09.800909
Title: Chain-of-Thought Hijacking
Title（参考訳）: Chain-of-Thought ハイジャック
Authors: Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, Fazl Barez,
Abstract要約: 推論モデルに対するジェイルブレイク攻撃であるChain-of-Thought Hijackingを紹介した。この攻撃は、無害パズル推論の長いシーケンスで有害な要求をパッドする。 HarmBench全体では、CoT Hijackingは99%、94%、100%、94%の攻撃成功率に達した。
参考スコア（独自算出の注目度）: 26.527942827274057
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large reasoning models (LRMs) achieve higher task performance by allocating more inference-time compute, and prior works suggest this scaled reasoning may also strengthen safety by improving refusal. Yet we find the opposite: the same reasoning can be used to bypass safeguards. We introduce Chain-of-Thought Hijacking, a jailbreak attack on reasoning models. The attack pads harmful requests with long sequences of harmless puzzle reasoning. Across HarmBench, CoT Hijacking reaches a 99%, 94%, 100%, and 94% attack success rate (ASR) on Gemini 2.5 Pro, GPT o4 mini, Grok 3 mini, and Claude 4 Sonnet, respectively - far exceeding prior jailbreak methods for LRMs. To understand the effectiveness of our attack, we turn to a mechanistic analysis, which shows that mid layers encode the strength of safety checking, while late layers encode the verification outcome. Long benign CoT dilutes both signals by shifting attention away from harmful tokens. Targeted ablations of attention heads identified by this analysis causally decrease refusal, confirming their role in a safety subnetwork. These results show that the most interpretable form of reasoning - explicit CoT - can itself become a jailbreak vector when combined with final-answer cues. We release prompts, outputs, and judge decisions to facilitate replication.
Abstract（参考訳）: 大規模推論モデル(LRM)は、より推論時間の計算を割り当てることで高いタスク性能を達成する。しかし、私たちは反対の理由を見つけます。同じ推論を使って安全を回避できます。推論モデルに対するジェイルブレイク攻撃であるChain-of-Thought Hijackingを紹介した。この攻撃は、無害パズル推論の長いシーケンスで有害な要求をパッドする。 HarmBenchの他、CoT HijackingはGemini 2.5 Pro、GPT o4 mini、Grok 3 mini、Claude 4 Sonnetで99%、94%、100%、94%の攻撃成功率(ASR)に達した。攻撃の有効性を理解するため,中間層が安全性検査の強度を,後半層が検証結果を符号化する機構解析を行った。長い良性CoTは有害なトークンから注意を移すことで両方の信号を希釈する。この分析によって同定された注意点の目的は、拒絶を因果的に減少させ、安全サブネットワークにおける彼らの役割を確認することである。これらの結果は、最も解釈可能な推論形式である明示的なCoTが、ファイナル・アンサー・キューと組み合わせることで、それ自体がジェイルブレイクベクターになることを示している。レプリケーションを容易にするプロンプト、アウトプット、および判断決定をリリースします。

論文の概要: Chain-of-Thought Hijacking

関連論文リスト