Fugu-MT 論文翻訳(概要): Hybrid Reward Normalization for Process-supervised Non-verifiable Agentic Tasks

論文の概要: Hybrid Reward Normalization for Process-supervised Non-verifiable Agentic Tasks

arxiv url: http://arxiv.org/abs/2509.25598v1
Date: Mon, 29 Sep 2025 23:44:55 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 17:09:04.366526
Title: Hybrid Reward Normalization for Process-supervised Non-verifiable Agentic Tasks
Title（参考訳）: プロセス制御型非検証エージェントタスクのハイブリッドリワード正規化
Authors: Peiran Xu, Zhuohao Li, Xiaoying Xing, Guannan Zhang, Debiao Li, Kunyu Shi,
Abstract要約: ステップレベルの評価と結果の検証を統一するRLアプローチである原則プロセス・リワード(PPR)を導入する。 PPRは幅広いベンチマークで最先端のパフォーマンスを実現し、その顕著な堅牢性と一般化を実証している。
参考スコア（独自算出の注目度）: 12.31210445905605
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) increasingly rely on external tools such as search engines to solve complex agentic tasks that require reasoning and external knowledge retrieval. Recently, reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in advancing capabilities of LLMs by rewarding the final answers via outcome rewards. While straightforward to supervise, outcome rewards only provide sparse signals and delayed feedback, which limits their effectiveness on long trajectories. Process rewards address this by evaluating intermediate steps, providing fine-grained supervision and encouraging grounded problem solving. However, it is notoriously hard to annotate step-wise labels, especially in non-verifiable process without "golden" answers. Furthermore, step-wise judgment requires the balance between local quality with contribution to the final outcome, as optimizing towards higher process reward may not always align with better final outcomes. To address the above challenges, we introduce Principle Process Reward (PPR), an RL approach that unifies principled step-level assessment and outcome verification. We train a principle-based reward model to improve the transparency and reliability of process evaluation, and further introduce a Reward Normalization (ReNorm) strategy to calibrate outcome and process rewards. Experiment results show that PPR achieves state-of-the-art performance across a wide range of benchmarks, demonstrating its impressive robustness and generalization. Our code and model collection is available in this link.
Abstract（参考訳）: 大規模言語モデル(LLM)は、推論や外部知識の検索を必要とする複雑なエージェントタスクを解決するために、検索エンジンのような外部ツールに依存している。近年、検証可能な報酬を用いた強化学習 (RLVR) は、結果報酬による最終回答の報奨により、LLMの能力向上に有効であることを示した。簡単に監視できるが、結果報酬はスパース信号と遅延フィードバックのみを提供し、長い軌道上での有効性を制限する。プロセス報酬は、中間ステップを評価し、きめ細かい監督と基礎的な問題解決を促進することで、この問題に対処します。しかし、特に「黄金」の答えがない検証不可能なプロセスでは、ステップワイドなラベルに注釈をつけるのは、悪名高い。さらに、ステップワイドな判断は、最終結果への貢献と局所的な品質のバランスを必要とする。上記の課題に対処するために、原則段階評価と結果検証を統一するRLアプローチである原則プロセス・リワード(PPR)を導入する。我々は、プロセス評価の透明性と信頼性を向上させるために、原則に基づく報酬モデルをトレーニングし、さらに、結果とプロセス報酬を校正するReward Normalization(ReNorm)戦略を導入する。実験の結果、PPRは様々なベンチマークで最先端のパフォーマンスを達成し、その顕著な堅牢性と一般化を実証した。私たちのコードとモデルコレクションはこのリンクで利用可能です。

論文の概要: Hybrid Reward Normalization for Process-supervised Non-verifiable Agentic Tasks

関連論文リスト