Fugu-MT 論文翻訳(概要): ExGRPO: Learning to Reason from Experience

論文の概要: ExGRPO: Learning to Reason from Experience

arxiv url: http://arxiv.org/abs/2510.02245v1
Date: Thu, 02 Oct 2025 17:31:30 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 16:59:21.254257
Title: ExGRPO: Learning to Reason from Experience
Title（参考訳）: ExGRPO: 経験から推論を学ぶ
Authors: Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F. Wong, Yu Cheng,
Abstract要約: 検証可能な報酬(RLVR)からの強化学習は、大規模言語モデルの推論能力を改善するための新たなパラダイムである。標準的なオンライントレーニングは、1回の更新後にロールアウトエクスペリエンスを捨て、計算の非効率性と不安定性につながる。本稿では,まず,経験価値の効果的な指標であるロールアウトの正しさとエントロピーを考察する。
参考スコア（独自算出の注目度）: 82.83309610498446
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models. However, standard on-policy training discards rollout experiences after a single update, leading to computational inefficiency and instability. While prior work on RL has highlighted the benefits of reusing past experience, the role of experience characteristics in shaping learning dynamics of large reasoning models remains underexplored. In this paper, we are the first to investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value. Based on these insights, we propose ExGRPO (Experiential Group Relative Policy Optimization), a framework that organizes and prioritizes valuable experiences, and employs a mixed-policy objective to balance exploration with experience exploitation. Experiments on five backbone models (1.5B-8B parameters) show that ExGRPO consistently improves reasoning performance on mathematical/general benchmarks, with an average gain of +3.5/7.6 points over on-policy RLVR. Moreover, ExGRPO stabilizes training on both stronger and weaker models where on-policy methods fail. These results highlight principled experience management as a key ingredient for efficient and scalable RLVR.
Abstract（参考訳）: 検証可能な報酬(RLVR)からの強化学習は、大規模言語モデルの推論能力を改善するための新たなパラダイムである。しかし、標準的なオンライントレーニングは、1回の更新後にロールアウトエクスペリエンスを捨て、計算の非効率性と不安定性につながる。 RLに関する以前の研究は、過去の経験を再利用するメリットを強調してきたが、大きな推論モデルの学習力学を形作る際の経験的特性の役割は、まだ解明されていない。本稿では,まず,経験価値の効果的な指標であるロールアウトの正しさとエントロピーを考察する。これらの知見に基づいて、我々は、価値ある経験を組織化し、優先順位付けするフレームワークであるExGRPO(Experiential Group Relative Policy Optimization)を提案し、経験の活用と探索のバランスをとるために、複合的な政治目標を採用する。 5つのバックボーンモデル(1.5B-8Bパラメータ)の実験は、ExGRPOが連続的に数学的/一般ベンチマークの推論性能を向上し、オンラインRLVRよりも平均3.5/7.6ポイント向上したことを示している。さらに、ExGRPOは、オンラインの手法が失敗するより強いモデルと弱いモデルのトレーニングを安定化する。これらの結果は、効率よくスケーラブルなRLVRの鍵となる要素として、経験管理の原則を浮き彫りにした。

論文の概要: ExGRPO: Learning to Reason from Experience

関連論文リスト