Fugu-MT 論文翻訳(概要): fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum

論文の概要: fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum

arxiv url: http://arxiv.org/abs/2605.11403v1
Date: Tue, 12 May 2026 01:48:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.517879
Title: fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum
Title（参考訳）: fg-expo: 適応klとガウスのカリキュラムによるフロンティア誘導探索優先政策最適化
Authors: Mingxiong Lin, Zhangquan Gong, Maowen Tang, Qian Li, Chuangchuang Wang, Jian Ma, Sutian Huang, Kai Tang, Haonan Lu,
Abstract要約: FG-ExPOは,Frontier-Guided Exploration-Prioritized Policy Optimizationの略である。精度制御KLスケーリング(AKL)は、バッチ平均精度のスムーズな非線形関数により、KLのペナルティ強度を調整する。我々は6つの主要な数学的推論ベンチマークでDeepSeek-R1-Distill-Qwen-1.5BとQwen3-8B-Baseの評価を行った。
参考スコア（独自算出の注目度）: 11.537163059885687
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard paradigm for LLM mathematical reasoning, with Group Relative Policy Optimization (GRPO) serving as the dominant algorithm. We identify two overlooked inefficiencies inherent in GRPO. First, a fixed KL coefficient overly restricts policy exploration at moments when the model needs to diverge significantly from the reference policy. Second, uniform question sampling overlooks that moderately difficult problems produce the most informative gradient signals. We propose FG-ExPO, short for Frontier-Guided Exploration-Prioritized Policy Optimization, which integrates two lightweight components. Accuracy-Conditioned KL Scaling (AKL) adjusts the KL penalty strength through a smooth nonlinear function of batch average accuracy, loosening the constraint when the model performs poorly and strengthening it when the model achieves satisfactory results. Gaussian Curriculum Sampling (GCS) assigns sampling weights to questions following a Gaussian distribution centered at a moderate accuracy level around 0.5, focusing model training on its learning frontier. We conduct evaluations on DeepSeek-R1-Distill-Qwen-1.5B and Qwen3-8B-Base across six mainstream mathematical reasoning benchmarks. Experimental results demonstrate that FG-ExPO consistently outperforms vanilla GRPO. It delivers an absolute improvement of 13.34 on the AIME 2025 pass@32 metric, rising from 63.33 percent to 76.67 percent, and obtains an average pass@32 gain of 2.66 on the 8B model. The substantially larger performance gains observed on pass@32 compared to pass@1 verify that FG-ExPO enlarges the model's effective exploration space under a fixed inference budget.
Abstract（参考訳）: RLVR(Reinforcement Learning with Verifiable Rewards)は、LLMの数学的推論の標準パラダイムとなり、GRPO(Group Relative Policy Optimization)が支配的なアルゴリズムとなっている。 GRPOに固有の2つの非効率性を見落としている。第一に、固定KL係数は、モデルが基準ポリシーと大きく異なる必要がある瞬間に、政策探索を過度に制限する。第二に、均一な質問サンプリングは、適度に難しい問題が最も情報に富む勾配信号を生み出すという見落としである。本稿では2つの軽量コンポーネントを統合したFrontier-Guided Exploration-Prioritized Policy OptimizationのためのFG-ExPOを提案する。精度調整KLスケーリング(AKL)は、バッチ平均精度のスムーズな非線形関数によりKLのペナルティ強度を調整し、モデルが不十分な場合の制約を緩和し、モデルが満足な結果を得た場合の制約を強化する。 Gaussian Curriculum Smpling (GCS) は、ガウス分布を0.5程度の精度で中心とした質問に対してサンプリング重量を割り当て、学習フロンティアにモデルトレーニングを集中させる。我々は6つの主要な数学的推論ベンチマークでDeepSeek-R1-Distill-Qwen-1.5BとQwen3-8B-Baseの評価を行った。 FG-ExPOはバニラGRPOより一貫して優れていた。 AIME 2025 pass@32 では 13.34 が絶対的に改善され、63.33 % から76.67 % に上昇し、8B モデルでは平均 2.66 でパス@32 が上昇した。 pass@32では、pass@1と比較して、FG-ExPOがモデルの効果的な探索空間を一定の推論予算の下で拡大することを検証している。

論文の概要: fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum

関連論文リスト