Fugu-MT 論文翻訳(概要): Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training

論文の概要: Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training

arxiv url: http://arxiv.org/abs/2510.20956v1
Date: Thu, 23 Oct 2025 19:34:24 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 09:00:15.304914
Title: Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training
Title（参考訳）: 自己破壊:言語モデルは、良心的推論訓練後に安全アライメントから脱却できる
Authors: Zheng-Xin Yong, Stephen H. Bach,
Abstract要約: 良心的な推論訓練の後、RLMは自身の安全ガードレールを回避するために複数の戦略を使用する。 DeepSeek-R1蒸留、s1.1、Phi-4-mini-reasoning、Nemotronを含む多くのオープンウェイトRLMは自己ジェイルブレイクに悩まされている。
参考スコア（独自算出の注目度）: 16.077654900815947
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We discover a novel and surprising phenomenon of unintentional misalignment in reasoning language models (RLMs), which we call self-jailbreaking. Specifically, after benign reasoning training on math or code domains, RLMs will use multiple strategies to circumvent their own safety guardrails. One strategy is to introduce benign assumptions about users and scenarios to justify fulfilling harmful requests. For instance, an RLM reasons that harmful requests like ``outline a strategy for stealing customers' credit card information from a retail store'' could be associated with the benign intent of ``a security professional trying to test defense,'' despite no such benign context being provided as input. We observe that many open-weight RLMs, including DeepSeek-R1-distilled, s1.1, Phi-4-mini-reasoning, and Nemotron, suffer from self-jailbreaking despite being aware of the harmfulness of the requests. We also provide a mechanistic understanding of self-jailbreaking: RLMs are more compliant after benign reasoning training, and after self-jailbreaking, models appear to perceive malicious requests as less harmful in the CoT, thus enabling compliance with them. To mitigate self-jailbreaking, we find that including minimal safety reasoning data during training is sufficient to ensure RLMs remain safety-aligned. Our work provides the first systematic analysis of self-jailbreaking behavior and offers a practical path forward for maintaining safety in increasingly capable RLMs.
Abstract（参考訳）: 推論言語モデル (RLM) における意図しない不一致の新たな現象を発見し,これを自己ジェイルブレイクと呼ぶ。具体的には、数学やコードドメインに関する合理的な推論トレーニングの後、RLMは、自身の安全ガードレールを回避するために、複数の戦略を使用する。 1つの戦略は、有害な要求を満たすことを正当化するために、ユーザとシナリオに関する良心的な仮定を導入することである。例えば、「顧客のクレジットカード情報を小売店から盗むための戦略をアウトライン化する」といった有害な要求が、「防衛をテストしようとするセキュリティ専門家」の良心に結びついているため、そのような良心的なコンテキストは入力として提供されない。我々は,DeepSeek-R1-distilled, s1.1, Phi-4-mini-reasoning, Nemotronを含む多くのオープンウェイトRLMが,要求の有害性を認識しながらも自己ジェイルブレイクに悩まされていることを観察した。我々はまた、自己ジェイルブレーカーの機械的理解も提供する: RLMは、良心的な推論訓練の後により適合しており、自己ジェイルブレーカーの後に、モデルは、CoTにおいて有害でないとして悪意ある要求を知覚し、それらへのコンプライアンスを可能にする。自己ジェイルブレーキングを緩和するため,トレーニング中の安全推論データを最小限に抑えることで,RTMの安全性確保が図られる。我々の研究は、自己ジェイルブレイク行動の体系的分析を初めて提供し、より有能なRCMの安全性を維持するための実践的な道筋を提供する。

論文の概要: Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training

関連論文リスト