Fugu-MT 論文翻訳(概要): Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization

論文の概要: Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization

arxiv url: http://arxiv.org/abs/2510.13554v1
Date: Wed, 15 Oct 2025 13:49:51 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-16 20:13:28.688253
Title: Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization
Title（参考訳）: アテンションイルミネートLDM推論:細粒化政策最適化を可能にするプレプラン&アンカーリズム
Authors: Yang Li, Zhichen Dong, Yuhan Sun, Weixun Wang, Shaopan Xiong, Yijia Luo, Jiashun Liu, Han Lu, Jiamang Wang, Wenbo Su, Bo Zheng, Junchi Yan,
Abstract要約: 強化学習(Reinforcement Learning, RL)は、一般的に、大規模言語モデルの全世代にわたって一様クレジットを適用する。この研究は、LSMの内部論理を推論自体の機械的青写真として描画する特権基板として注意を向けている。クリティカルノードに対するターゲットクレジット割り当てを動的に行う3つの新しいRL戦略を導入する。
参考スコア（独自算出の注目度）: 56.083511902353365
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The reasoning pattern of Large language models (LLMs) remains opaque, and Reinforcement learning (RL) typically applies uniform credit across an entire generation, blurring the distinction between pivotal and routine steps. This work positions attention as a privileged substrate that renders the internal logic of LLMs legible, not merely as a byproduct of computation, but as a mechanistic blueprint of reasoning itself. We first distinguish attention heads between locally and globally focused information processing and reveal that locally focused heads produce a sawtooth pattern near the diagonal indicating phrasal chunks, while globally focused heads expose tokens that exert broad downstream influence over future tokens. We formalize these with two metrics: 1) Windowed Average Attention Distance, which measures the extent of backward attention within a clipped window; 2) Future Attention Influence, which quantifies a token's global importance as the average attention it receives from subsequent tokens. Taken together, these signals reveal a recurring preplan-and-anchor mechanism, where the model first performs a long-range contextual reference to generate an introductory token, which is immediately followed by or coincides with a semantic anchor token that organizes subsequent reasoning. Leveraging these insights, we introduce three novel RL strategies that dynamically perform targeted credit assignment to critical nodes (preplan tokens, anchor tokens, and their temporal coupling) and show consistent performance gains across various reasoning tasks. By aligning optimization with the model's intrinsic reasoning rhythm, we aim to transform opaque optimization into an actionable structure-aware process, hoping to offer a potential step toward more transparent and effective optimization of LLM reasoning.
Abstract（参考訳）: 大規模言語モデル(LLM)の推論パターンはいまだ不透明であり、強化学習(RL)は一般的に全世代にわたって一様信用を適用し、重要なステップとルーチンステップの区別を曖昧にする。この研究は、LLMの内部論理を正当に記述する特権基板として、単に計算の副産物としてではなく、推論自体の機械的青写真として注目されている。まず,局所的に焦点を絞った情報処理とグローバルな情報処理を区別し,局所的に焦点を絞った頭部が対角線付近にソートゥースパターンを生じさせるのに対して,グローバルに焦点を絞った頭部は,将来のトークンに幅広いダウンストリームの影響を及ぼすトークンを露出させる。これらを2つのメトリクスで形式化します。 1) カットした窓内の後方の注意度を測定する窓付き注意距離 2) トークンのグローバルな重要性を、その後のトークンから受ける平均的な注意力として定量化する将来の注意の影響。これらの信号は繰り返し発生するプリプラン・アンド・アンカー機構を示し、モデルが最初に長距離のコンテキスト参照を実行して導入トークンを生成し、その後すぐに続くか、あるいはその後の推論を整理するセマンティックアンカートークンと一致する。これらの知見を活かして、3つの新しいRL戦略を導入し、クリティカルノード(プリプラントークン、アンカートークン、およびそれらの時間的結合)へのターゲットクレジット割り当てを動的に実行し、様々な推論タスクにおいて一貫したパフォーマンス向上を示す。モデル固有の推論リズムに最適化を合わせることで、不透明な最適化を実行可能な構造認識プロセスに変換することを目指しており、LCM推論をより透過的で効果的に最適化するための潜在的なステップを提供したいと考えている。

論文の概要: Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization

関連論文リスト