Fugu-MT 論文翻訳(概要): Reliable Control-Point Selection for Steering Reasoning in Large Language Models

論文の概要: Reliable Control-Point Selection for Steering Reasoning in Large Language Models

arxiv url: http://arxiv.org/abs/2604.02113v1
Date: Thu, 02 Apr 2026 14:48:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-03 14:21:10.864748
Title: Reliable Control-Point Selection for Steering Reasoning in Large Language Models
Title（参考訳）: 大規模言語モデルにおけるステアリング推論のための信頼性の高い制御点選択
Authors: Haomin Zhuang, Hojun Yoo, Xiaonan Luo, Kehan Guo, Xiangliang Zhang,
Abstract要約: ステアリングベクトルは、大規模言語モデルにおける推論動作を制御するためのトレーニング不要のメカニズムを提供する。しかし、有効なベクトルを構成するには、モデルが隠した状態にある真の行動信号を特定する必要がある。提案手法は,全ての検出された境界が真の行動信号を符号化していることを暗黙的に仮定して,チェーンオブソートトレースのキーワードマッチングによってこれらの挙動を検出する。本研究では,コンテキスト依存的なトリガ確率を持つ事象として固有の推論動作を形式化する確率モデルを構築し,不安定な境界が操舵信号を弱めることを示す。
参考スコア（独自算出の注目度）: 28.288321095634128
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Steering vectors offer a training-free mechanism for controlling reasoning behaviors in large language models, but constructing effective vectors requires identifying genuine behavioral signals in the model's hidden states. For behaviors that can be toggled via prompts, this is straightforward. However, many reasoning behaviors -- such as self-reflection -- emerge spontaneously and resist prompt-level control. Current methods detect these behaviors through keyword matching in chain-of-thought traces, implicitly assuming that every detected boundary encodes a genuine behavioral signal. We show that this assumption is overwhelmingly wrong: across 541 keyword-detected boundaries, 93.3\% are behaviorally unstable, failing to reproduce the detected behavior under re-generation from the same prefix. We develop a probabilistic model that formalizes intrinsic reasoning behaviors as stochastic events with context-dependent trigger probabilities, and show that unstable boundaries dilute the steering signal. Guided by this analysis, we propose stability filtering, which retains only boundaries where the model consistently reproduces the target behavior. Combined with a content-subspace projection that removes residual question-specific noise, our method achieves 0.784 accuracy on MATH-500 (+5.0 over the strongest baseline). The resulting steering vectors transfer across models in the same architecture family without re-extraction, improving Nemotron-Research-Reasoning-1.5B (+5.0) and DeepScaleR-1.5B-Preview (+6.0). Code is available at https://github.com/zhmzm/stability-steering.
Abstract（参考訳）: ステアリングベクターは、大規模言語モデルにおける推論行動を制御するためのトレーニング不要なメカニズムを提供するが、効果的なベクターを構築するには、モデルの隠れた状態における真の行動信号を特定する必要がある。プロンプトでトグルできる動作については、これは簡単です。しかし、自己回帰のような多くの理性行動が自然に発生し、即時制御に抵抗する。現在の方法では、チェーンオブソートトレースのキーワードマッチングを通じてこれらの振る舞いを検出し、検出された境界が真の行動信号にエンコードされていることを暗黙的に仮定している。 541のキーワード検出境界において、93.3\%は動作が不安定であり、同じプレフィックスから検出された振る舞いを再現できない。本研究では,確率的事象と文脈依存的トリガ確率を交互に定式化する確率論的モデルを構築し,不安定な境界がステアリング信号の希薄化を示す。この分析で導かれた安定性フィルタリングは,モデルが常に対象の振る舞いを再現する境界のみを保持する。本手法は,MATH-500の残差ノイズを除去するコンテンツサブスペース投影と組み合わせて,MATH-500の0.784精度(最強ベースライン以上+5.0)を実現する。その結果得られたステアリングベクトルは、再抽出することなく同じアーキテクチャ系のモデル間で転送され、Nemotron-Research-Reasoning-1.5B (+5.0)とDeepScaleR-1.5B-Preview (+6.0)が改善された。コードはhttps://github.com/zhmzm/stability-steering.comから入手できる。

論文の概要: Reliable Control-Point Selection for Steering Reasoning in Large Language Models

関連論文リスト