Fugu-MT 論文翻訳(概要): Beyond Trajectory Imitation: Strategy-Guided Policy Optimization for LLM Reasoning

論文の概要: Beyond Trajectory Imitation: Strategy-Guided Policy Optimization for LLM Reasoning

arxiv url: http://arxiv.org/abs/2606.24064v1
Date: Tue, 23 Jun 2026 02:14:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 22:16:48.732529
Title: Beyond Trajectory Imitation: Strategy-Guided Policy Optimization for LLM Reasoning
Title（参考訳）: 軌道模倣を超えて: LLM推論のための戦略誘導型ポリシー最適化
Authors: Tianyuan Shi, Canbin Huang, Bei Li, Xin Chen, Xiaojun Quan, Jingang Wang, Qifan Wang,
Abstract要約: 強い言語モデルから弱い言語モデルへの推論能力の蒸留は、典型的には特定の解軌跡を模倣する。この軌道レベルの模倣は、伝達可能な問題解決スキルの獲得よりも、インスタンス固有のステップの記憶を促進する。再利用可能な戦略蒸留にインスタンスレベルの軌道模倣を置き換えた戦略誘導型政策最適化を提案する。
参考スコア（独自算出の注目度）: 76.93011742289768
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Distilling reasoning capabilities from strong to weak language models typically involves imitating specific solution trajectories, effectively transferring what to answer rather than how to reason. This trajectory-level imitation encourages memorization of instance-specific steps rather than acquisition of transferable problem-solving skills, limiting generalization to novel problems. We propose Strategy-Guided Policy Optimization (SGPO), which replaces instance-level trajectory imitation with reusable strategy distillation. SGPO extracts structured strategy descriptions from strong-model responses and, for each problem, constructs both autonomous and strategy-guided trajectories to enable direct comparison of the model's behavior with and without strategic guidance. The framework then addresses two key questions. For how to distill, a token-level forward-KL objective selectively transfers the distributional shift induced by strategy conditioning into the unguided policy, with proximal constraints ensuring stability. For when to distill, adaptive instance-level weighting strengthens guidance when autonomous exploration falls short and reduces it as the model's own competence grows. Experiments on four mathematical benchmarks across two model families show that SGPO consistently outperforms SFT, on-policy RL, and hybrid-policy baselines, improving the average score by 2.2 points over the strongest baseline on Qwen2.5-7B-Instruct. Analysis reveals that the forward-KL objective provides an inherently selective distillation signal that outperforms direct trajectory imitation, and that strategy distillation exhibits complementary scaling with base model capability.
Abstract（参考訳）: 強い言語モデルから弱い言語モデルへの推論能力の蒸留は、典型的には特定の解の軌跡を模倣し、推論するよりも答えを効果的に伝達する。この軌道レベルの模倣は、転送可能な問題解決スキルの獲得よりも、インスタンス固有のステップの記憶を促進し、新しい問題への一般化を制限する。再利用可能な戦略蒸留にインスタンスレベルの軌道模倣を置き換えた戦略誘導型政策最適化(SGPO)を提案する。 SGPOは、強いモデル応答から構造化された戦略記述を抽出し、各問題に対して、自律軌道と戦略誘導軌道の両方を構築して、戦略的なガイダンスなしでモデルの振舞いを直接比較できるようにする。フレームワークは2つの重要な疑問に対処する。蒸留方法として、トークンレベルのフォワードKLは、戦略条件によって誘導される分散シフトを、安定性を確保するための近位制約とともに、無誘導のポリシーに選択的に転送する。蒸留に際し、適応的なインスタンスレベルの重み付けは、自律的な探索が不足するとガイダンスを強化し、モデル自身の能力が増大するにつれてそれを減らす。 2つのモデルファミリーにわたる4つの数学ベンチマークの実験により、SGPOは、Qwen2.5-7B-インストラクト上で最強のベースラインに対して平均スコアを2.2ポイント改善し、SFT、オン・ポリティカ・RL、ハイブリッド・ポリティカ・ベースラインを一貫して上回っていることが示されている。分析の結果, フォワードKLの目的は, 直接的軌道模倣よりも優れた本質的に選択的蒸留信号を提供し, 基本モデル能力と相補的なスケーリングを示すことがわかった。

論文の概要: Beyond Trajectory Imitation: Strategy-Guided Policy Optimization for LLM Reasoning

関連論文リスト