Fugu-MT 論文翻訳(概要): Grad2Reward: From Sparse Judgment to Dense Rewards for Improving Open-Ended LLM Reasoning

論文の概要: Grad2Reward: From Sparse Judgment to Dense Rewards for Improving Open-Ended LLM Reasoning

arxiv url: http://arxiv.org/abs/2602.01791v1
Date: Mon, 02 Feb 2026 08:13:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-03 19:28:34.005771
Title: Grad2Reward: From Sparse Judgment to Dense Rewards for Improving Open-Ended LLM Reasoning
Title（参考訳）: Grad2Reward:オープンエンドLDM推論改善のためのスパース判断からDense Rewardsへ
Authors: Zheng Zhang, Ao Lu, Yuanhao Zeng, Ziwei Shan, Jinjin Guo, Lufei Li, Yexin Li, Kan Ren,
Abstract要約: Grad2Rewardは、ジャッジのモデル推論プロセスから直接、単一の後方パスを介して、密集したプロセス報酬を抽出する。 Grad2Rewardはグラデーションベースの属性を利用することで、正確なトークンレベルのクレジット割り当てを可能にする。 Grad2Rewardで最適化されたポリシーは、様々なオープンエンドタスクで優れたパフォーマンスを実現している。
参考スコア（独自算出の注目度）: 18.80588864499134
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant breakthroughs in complex LLM reasoning within verifiable domains, such as mathematics and programming. Recent efforts have sought to extend this paradigm to open-ended tasks by employing LLMs-as-a-Judge to provide sequence-level rewards for policy optimization. However, these rewards are inherently sparse, failing to provide the fine-grained supervision necessary for generating complex, long-form trajectories. Furthermore, current work treats the Judge as a black-box oracle, discarding the rich intermediate feedback signals encoded in it. To address these limitations, we introduce Grad2Reward, a novel framework that extracts dense process rewards directly from the Judge's model inference process via a single backward pass. By leveraging gradient-based attribution, Grad2Reward enables precise token-level credit assignment, substantially enhancing training efficiency and reasoning quality. Additionally, Grad2Reward introduces a self-judging mechanism, allowing the policy to improve through its own evaluative signals without training specialized reward models or reliance on superior external Judges. The experiments demonstrate that policies optimized with Grad2Reward achieve outstanding performance across diverse open-ended tasks, affirming its effectiveness and broad generalizability.
Abstract（参考訳）: RLVR(Reinforcement Learning with Verifiable Rewards)は、数学やプログラミングなどの検証可能な領域において、複雑なLLM推論において重要なブレークスルーを引き起こしている。近年、政策最適化にLLM-as-a-Judgeを用いて、このパラダイムをオープンエンドタスクに拡張しようと試みている。しかし、これらの報酬は本質的には希少であり、複雑な長い軌道を発生させるために必要なきめ細かい監督を与えられなかった。さらに、現在の研究は、ジャッジをブラックボックスのオラクルとして扱い、その中にエンコードされたリッチな中間フィードバック信号を破棄する。このような制限に対処するために、単一の後方パスを介して、ジャッジのモデル推論プロセスから直接、高密度なプロセス報酬を抽出する新しいフレームワークであるGrad2Rewardを紹介します。 Grad2Rewardは、勾配に基づく属性を活用することにより、正確なトークンレベルのクレジット割り当てを可能にし、トレーニング効率と推論品質を大幅に向上させる。さらに、Grad2Rewardは自己判断機構を導入し、特別な報酬モデルや優れた外部裁判官への依存を訓練することなく、独自の評価信号を通じてポリシーを改善することができる。実験により、Grad2Rewardで最適化されたポリシーは、様々なオープンエンドタスクにまたがって優れた性能を達成し、その有効性と広範な一般化性を確認した。

論文の概要: Grad2Reward: From Sparse Judgment to Dense Rewards for Improving Open-Ended LLM Reasoning

関連論文リスト