Fugu-MT 論文翻訳(概要): Do Thinking Tokens Help with Safety?

論文の概要: Do Thinking Tokens Help with Safety?

arxiv url: http://arxiv.org/abs/2606.25013v1
Date: Tue, 23 Jun 2026 17:59:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-25 17:05:30.1068
Title: Do Thinking Tokens Help with Safety?
Title（参考訳）: トークンは安全に役立つか?
Authors: Narutatsu Ri, Abhishek Panigrahi, Sanjeev Arora,
Abstract要約: 現在の推論モデルにおける安全性の挙動は、一般的に想定されるよりもはるかに議論的でないことを示す。また、既存の推論時間とトレーニングベースの安全介入は、熟考の目的に動機づけられたものの、モデル行動が過度に拒絶される傾向にあることも見出した。
参考スコア（独自算出の注目度）: 34.336035944909746
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Today's reasoning models use thinking tokens to attain stronger performance on benchmarks than their instruction-tuned counterparts. It is also generally believed that this more "deliberative" mode should improve alignment and safety, by providing the model a safe space to consider whether its planned answer to a request violates its safety principles. We present evidence that this intuition is not always correct. Across frontier open-weight reasoning models spanning GPT-OSS, Qwen, Olmo, and Phi families, we find that the eventual refusal/compliance outcome is already strongly predictable via a trained head on the first token's hidden representation ($0.84$-$0.95$ AUROC and $\sim88\%$ balanced accuracy for predicting refusal/compliance) before any visible thinking. The thinking process turns out to be more akin to prefix completion than to deliberative revision, with the final outcome rarely changing after the first $\sim20\%$ of thinking, despite giving the appearance of deliberation at the text level ($\sim74\%$ of text-level deliberations occur when the response distribution is already locked to one refusal/compliance side). We also find that existing inference-time and training-based safety interventions, despite being motivated by the goal of inducing deliberation, largely shift model behavior toward over-refusal while suppressing already-scarce deliberation signals. Our results suggest that safety behavior in current reasoning models is much less deliberative than commonly assumed, and highlight the need for methods that induce real safety deliberation.
Abstract（参考訳）: 今日の推論モデルは、シンキングトークンを使用して、インストラクションチューニングされたものよりもベンチマークのパフォーマンスを向上します。また、この「検討的な」モードは、要求に対する回答が安全性の原則に違反しているかどうかを考慮し、モデルに安全な空間を提供することによって、アライメントと安全性を改善するべきであると一般的に信じられている。この直観が必ずしも正しいとは限らないという証拠を提示する。 GPT-OSS、Qwen、Olmo、Phiファミリーにまたがるフロンティアのオープンウェイト推論モデル全体で、最終的な拒絶/コンプライアンスの結果は、目に見える思考よりも前に、最初のトークンの隠れ表現(0.84$-0.95$AUROCおよび$\sim88\%$バランスの取れた精度)のトレーニングヘッドを介して、すでに強く予測可能である。最終的な結果は、最初の$\sim20\%$の思考の後、テキストレベルでの議論の出現("\sim74\%$ of text-level deliberations" は、応答分布が既に1つの拒絶/コンプライアンス側にロックされているときに発生する)にもかかわらず、ほとんど変化しない。また、既存の推論時間とトレーニングに基づく安全介入は、熟考を誘導するという目標に動機付けられつつも、既に過度な熟考のシグナルを抑えながら、モデル行動が過度に拒絶される傾向にあることも見出した。以上の結果から,現在の推論モデルにおける安全性の挙動は,一般的に想定されるよりもはるかに議論的ではないことが示唆され,実際の安全性の議論を誘発する手法の必要性が浮き彫りにされている。

論文の概要: Do Thinking Tokens Help with Safety?

関連論文リスト