Fugu-MT 論文翻訳(概要): Are complicated loss functions necessary for teaching LLMs to reason?

論文の概要: Are complicated loss functions necessary for teaching LLMs to reason?

arxiv url: http://arxiv.org/abs/2603.18756v1
Date: Thu, 19 Mar 2026 11:06:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-20 17:19:06.100245
Title: Are complicated loss functions necessary for teaching LLMs to reason?
Title（参考訳）: LLMに推論を教えるには複雑な損失関数が必要か?
Authors: Gabriele Carrino, Andrea Sassella, Nicolo Brunello, Federico Toschi, Mark James Carman,
Abstract要約: グループ相対政策最適化は、大規模言語モデル(LLM)において有望であることを示す。 ReINFORCE with Group Relative Advantage (RGRA) は、グループ相対的優位性を保ちつつ、PPOスタイルのクリッピングとポリシー比の項を除去する単純化された変種である。以上の結果から,よりシンプルなREINFORCEベースのアプローチはLLMの推論を効果的に促進し,GRPOのより透明で効率的な代替手段を提供する可能性が示唆された。
参考スコア（独自算出の注目度）: 0.16383644639245779
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in large language models (LLMs) highlight the importance of post training techniques for improving reasoning and mathematical ability. Group Relative Policy Optimization (GRPO) has shown promise in this domain by combining group relative advantage estimation, PPO style clipping, and KL regularization. However, its complexity raises the question of whether all components are necessary for fostering reasoning behaviors. We conduct a systematic analysis of GRPO and identify two key findings: (1) incorporating negative feedback is essential training solely on actions above a baseline limits learning; and (2) PPO style constraints, such as policy ratio clipping, are not required to improve mathematical reasoning or performance. Building on these insights, we propose REINFORCE with Group Relative Advantage (RGRA), a simplified variant that retains group relative advantage estimation but removes PPO style clipping and policy ratio terms. Experiments across standard mathematical benchmarks indicate that RGRA has the potential to achieve stronger performance than GRPO. Our results suggest that simpler REINFORCE based approaches can effectively enhance reasoning in LLMs, offering a more transparent and efficient alternative to GRPO.
Abstract（参考訳）: 大規模言語モデル(LLM)の最近の進歩は、推論と数学的能力を改善するためのポストトレーニング技術の重要性を強調している。グループ相対的政策最適化(GRPO)は、グループ相対的優位性推定、PPOスタイルのクリッピング、KL正規化を組み合わせることで、この領域において有望であることを示す。しかし、その複雑さは、すべてのコンポーネントが推論行動を促進するのに必要かどうかという問題を提起する。我々はGRPOの体系的な分析を行い、(1) 負のフィードバックを取り入れることが、基礎的限界学習以上の行動のみに必須の訓練であること、(2) 数学的推論や性能を改善するためにPPOスタイルの制約を必要としない、という2つの重要な知見を同定する。これらの知見に基づいて、グループ相対的優位性評価を保ちつつ、PPOスタイルのクリッピングとポリシー比の項を除去する単純化された変種であるグループ相対的アドバンテージ(RGRA)を用いたREINFORCEを提案する。標準的な数学ベンチマークによる実験は、RGRAがGRPOよりも強力な性能を達成する可能性を示している。以上の結果から,よりシンプルなREINFORCEベースのアプローチはLLMの推論を効果的に促進し,GRPOのより透明で効率的な代替手段を提供する可能性が示唆された。

論文の概要: Are complicated loss functions necessary for teaching LLMs to reason?

関連論文リスト