Fugu-MT 論文翻訳(概要): GRPO-$λ$: Credit Assignment improves LLM Reasoning

論文の概要: GRPO-$λ$: Credit Assignment improves LLM Reasoning

arxiv url: http://arxiv.org/abs/2510.00194v1
Date: Tue, 30 Sep 2025 19:11:10 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 16:59:20.221552
Title: GRPO-$λ$: Credit Assignment improves LLM Reasoning
Title（参考訳）: GRPO-$λ$:Credit AssignmentがLLM推論を改善した
Authors: Prasanna Parthasarathi, Mathieu Reymond, Boxing Chen, Yufei Cui, Sarath Chandar,
Abstract要約: GRPO-$lambda$は、複雑な推論タスクのためのLLMのRL微調整におけるクレジット割り当てを強化するGRPOの新たな拡張である。 GRPO-$lambda$とGRPOを比較し、1.5Bから7Bパラメータのモデルを4ドルの異なる数学推論データセットでトレーニングする。 GRPO-$lambda$では、AIME24、Math500、OlympiadMath、MinervaMath、AMCの平均的なパフォーマンスはGRPOよりも3ドル以上改善され、7Bモデルでは4.5ドルポイント改善されている。
参考スコア（独自算出の注目度）: 35.452488047246646
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly deployed for tasks requiring complex reasoning, prompting significant interest in improving their reasoning abilities through post-training. Especially RL based methods using verifiable reward, like the state-of-the-art GRPO, have shown to tremendously improve reasoning behaviors when applied as post-training methods. However, the lack of an explicit reward or critic model limits GRPO's ability to assign fine-grained credit across token sequences. In this work, we present GRPO-$\lambda$, a novel extension to GRPO that enhances credit assignment in RL finetuning of LLMs for complex reasoning tasks. We approximate learning from $\lambda$-return with a reformulation of eligibility traces using token-level log-probabilities applied after each sequence generation, and a novel critic-free approximation of the temporal-difference error. We introduce a few variations for the weighting of the $\lambda$-return, and their applications to the eligibility-trace, where all the variations provide significant gains over GRPO. We compare GRPO-$\lambda$ against GRPO by training models from 1.5B to 7B parameters on $4$ different math reasoning datasets. The training plots demonstrate 30-40% improved performance during RL training on both LLaMA-3.1 and Qwen-2.5 architectures. Finally, we show that with GRPO-$\lambda$, the resulting average performance on AIME24, Math500, OlympiadMath, MinervaMath, and AMC improves over GRPO by over $3$ points and a $4.5$ points improvement on the 7B model.
Abstract（参考訳）: 大規模言語モデル(LLM)は、複雑な推論を必要とするタスクに対してますます多くデプロイされ、ポストトレーニングを通じて推論能力を改善することに大きな関心が寄せられている。特に、最先端GRPOのような検証可能な報酬を用いたRL法は、ポストトレーニング法として適用された場合の推論挙動を著しく改善することを示した。しかし、明示的な報酬や批評家モデルがないため、GRPOはトークンシーケンス間できめ細かなクレジットを割り当てる能力に制限がある。本稿では,複雑な推論タスクのためのLLMのRL微調整におけるクレジット割り当てを強化するGRPOの新規拡張であるGRPO-$\lambda$を提案する。我々は,各シーケンス生成後に適用されるトークンレベルのログ確率を用いて,可視性トレースの再構成と,時間差誤差の新たな批判のない近似を用いて,$\lambda$-returnから学習する。我々は、$\lambda$-returnの重み付けのためのいくつかのバリエーションを導入し、それらの応用を、GRPOに対して有意な利益をもたらす、可視性-トレースに適用する。 GRPO-$\lambda$とGRPOを比較し、1.5Bから7Bパラメータのモデルを4ドルの異なる数学推論データセットでトレーニングする。トレーニングプロットは、LLaMA-3.1とQwen-2.5アーキテクチャの両方でのRLトレーニングで30-40%改善された性能を示している。最後に、GRPO-$\lambda$では、AIME24、Math500、OlympiadMath、MinervaMath、AMCの平均性能がGRPOよりも3ドル以上改善され、7Bモデルでは4.5ドルポイント改善されていることを示す。

論文の概要: GRPO-$λ$: Credit Assignment improves LLM Reasoning

関連論文リスト