Fugu-MT 論文翻訳(概要): Gradient Extrapolation-Based Policy Optimization

論文の概要: Gradient Extrapolation-Based Policy Optimization

arxiv url: http://arxiv.org/abs/2605.06755v1
Date: Thu, 07 May 2026 16:20:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:38.513762
Title: Gradient Extrapolation-Based Policy Optimization
Title（参考訳）: 勾配外挿に基づく政策最適化
Authors: Ismam Nur Swapnil, Aranya Saha, Tanvir Ahmed Khan, Mohammad Ariful Haque, Ser-Nam Lim,
Abstract要約: GXPOは、アクティブフェーズ中に3つの後方パスのみを使用して、より長い局所的なルックアヘッドを近似する。 GXPOは2つの速いステップを踏んで、変更の方法を測定し、仮想的なKステップのルックアヘッドポイントを予測し、ポリシーをそのポイントへ移動し、修正更新を適用する。ルックアヘッド信号が不安定になると、GXPOは自動的に標準のシングルパスGRPOに切り替える。
参考スコア（独自算出の注目度）: 35.73727913372324
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning is widely used to improve the reasoning ability of large language models, especially when answers can be automatically checked. Standard GRPO-style training updates the model using only the current step, while full multi-step lookahead can give a better update direction but is too expensive because it needs many backward passes. We propose Gradient Extrapolation-Based Policy Optimization (GXPO), a plug-compatible policy-update rule for GRPO-style reasoning RL. GXPO approximates a longer local lookahead using only three backward passes during an active phase. It reuses the same batch of rollouts, rewards, advantages, and GRPO loss, so it does not require new rollouts or reward computation at the lookahead points. GXPO takes two fast optimizer steps, measures how the gradients change, predicts a virtual K-step lookahead point, moves the policy partway toward that point, and then applies a corrective update using the true gradient at the new position. When the lookahead signal becomes unstable, GXPO automatically switches back to standard single-pass GRPO. We also give a plain-gradient-descent surrogate analysis that explains when the extrapolation is exact and where its local errors come from. Across Qwen2.5 and Llama math-reasoning experiments, GXPO improves the average sampled pass@1 by +1.65 to +5.00 points over GRPO and by +0.14 to +1.28 points over the strongest SFPO setting, while keeping the active-phase cost fixed at three backward passes. It also achieves up to 4.00x step speedup, 2.33x wall-clock speedup, and 1.33x backward-pass speedup in reaching GRPO's peak accuracy.
Abstract（参考訳）: 強化学習は大規模言語モデルの推論能力を改善するために広く用いられている。標準のGRPOスタイルのトレーニングでは、現在のステップのみを使用してモデルを更新するが、完全なマルチステップのルックアヘッドは、より優れた更新方向を提供することができるが、多くの後方パスを必要とするため、高すぎる。本稿では,GRPO型推論RLのためのプラグイン互換ポリシー更新ルールであるGXPOを提案する。 GXPOは、アクティブフェーズ中に3つの後方パスのみを使用して、より長い局所的なルックアヘッドを近似する。同じロールアウト、報酬、アドバンテージ、GRPO損失のバッチを再利用しているため、新しいロールアウトや、ルックアヘッドポイントでの報酬計算を必要としない。 GXPOは2つの高速オプティマイザステップを取り、勾配がどのように変化するかを測定し、仮想的なKステップのルックアヘッドポイントを予測する。ルックアヘッド信号が不安定になると、GXPOは自動的に標準のシングルパスGRPOに切り替える。また、外挿がいつ正確なのか、その局所的な誤差がどこから来たのかを説明する、平坦な漸進的なサロゲート解析も行います。 Qwen2.5とLlamaの数学推論実験では、GXPOは平均的なサンプルパス@1をGRPO上で+1.65から+5.00ポイント、最強SFPO設定で+0.14から+1.28ポイントに改善し、3つの後方パスで固定されたアクティブフェーズコストを維持する。また、GRPOのピーク精度に達すると、最大4.00倍のステップスピードアップ、2.33倍のウォールクロックスピードアップ、1.33倍の後方パススピードアップを達成する。

論文の概要: Gradient Extrapolation-Based Policy Optimization

関連論文リスト