Fugu-MT 論文翻訳(概要): GPO: Learning from Critical Steps to Improve LLM Reasoning

論文の概要: GPO: Learning from Critical Steps to Improve LLM Reasoning

arxiv url: http://arxiv.org/abs/2509.16456v1
Date: Fri, 19 Sep 2025 22:30:23 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-23 18:58:15.801679
Title: GPO: Learning from Critical Steps to Improve LLM Reasoning
Title（参考訳）: GPO: LLM推論を改善するために重要なステップから学ぶ
Authors: Jiahao Yu, Zelei Cheng, Xian Wu, Xinyu Xing,
Abstract要約: textbfGuided textbfPivotal textbfOptimization (GPO)を導入する。 GPOは様々な最適化手法と統合して推論性能を向上させるための一般的な戦略であることを実証する。
参考スコア（独自算出の注目度）: 13.271737599933147
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly used in various domains, showing impressive potential on different tasks. Recently, reasoning LLMs have been proposed to improve the \textit{reasoning} or \textit{thinking} capabilities of LLMs to solve complex problems. Despite the promising results of reasoning LLMs, enhancing the multi-step reasoning capabilities of LLMs still remains a significant challenge. While existing optimization methods have advanced the LLM reasoning capabilities, they often treat reasoning trajectories as a whole, without considering the underlying critical steps within the trajectory. In this paper, we introduce \textbf{G}uided \textbf{P}ivotal \textbf{O}ptimization (GPO), a novel fine-tuning strategy that dives into the reasoning process to enable more effective improvements. GPO first identifies the `critical step' within a reasoning trajectory - a point that the model must carefully proceed to succeed at the problem. We locate the critical step by estimating the advantage function. GPO then resets the policy to the critical step, samples the new rollout and prioritizes the learning process on those rollouts. This focus allows the model to learn more effectively from pivotal moments within the reasoning process to improve the reasoning performance. We demonstrate that GPO is a general strategy that can be integrated with various optimization methods to improve reasoning performance. Besides theoretical analysis, our experiments across challenging reasoning benchmarks show that GPO can consistently and significantly enhance the performance of existing optimization methods, showcasing its effectiveness and generalizability in improving LLM reasoning by concentrating on pivotal moments within the generation process.
Abstract（参考訳）: 大規模言語モデル(LLM)は、様々な領域でますます使われており、様々なタスクにおいて顕著なポテンシャルを示している。近年、LCMの推論は、複雑な問題を解決するために、LCMの \textit{reasoning} や \textit{thinking} の能力を改善するために提案されている。 LLMの推論の有望な結果にもかかわらず、LLMの多段階推論能力の向上は依然として大きな課題である。既存の最適化手法はLSM推論能力を進歩させてきたが、軌道の根底にある重要なステップを考慮せずに、推論軌道全体を扱うことが多い。本稿では,より効果的な改善を実現するため,推論プロセスに飛び込み,新しい微調整戦略である \textbf{G}uided \textbf{P}ivotal \textbf{O}ptimization (GPO)を紹介する。 GPOはまず、推論の軌道内での‘クリティカルなステップ’を識別します。優位関数を推定することで重要なステップを見つける。そして、GPOはポリシーをクリティカルステップにリセットし、新しいロールアウトをサンプリングし、それらのロールアウトで学習プロセスを優先順位付けする。この焦点により、モデルは推論プロセス内の重要な瞬間からより効果的に学習し、推論のパフォーマンスを改善することができる。 GPOは様々な最適化手法と統合して推論性能を向上させるための一般的な戦略であることを実証する。理論的解析の他に,GPOは従来の最適化手法の性能を継続的に,かつ著しく向上させ,生成過程における重要なモーメントに集中してLLM推論を改善する上での有効性と一般化性を示す。

論文の概要: GPO: Learning from Critical Steps to Improve LLM Reasoning

関連論文リスト