Fugu-MT 論文翻訳(概要): AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress

論文の概要: AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress

arxiv url: http://arxiv.org/abs/2511.08325v1
Date: Wed, 12 Nov 2025 01:53:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-12 20:17:03.762208
Title: AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress
Title（参考訳）: AgentPRM:ステップワイズ約束と進捗によるLCMエージェントのプロセスリワードモデル
Authors: Zhiheng Xi, Chenyang Liao, Guanyu Li, Yajie Yang, Wenxiang Chen, Zhihao Zhang, Binghai Wang, Senjie Jin, Yuhao Zhou, Jian Guan, Wei Wu, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang,
Abstract要約: 大規模言語モデル(LLM)は、マルチターン意思決定タスクにおいて依然として課題に直面している。プロセス報酬モデル(PRM)を構築し、各意思決定を評価し、エージェントの意思決定プロセスを導く。 AgentPRMは、シーケンシャルな決定と最終的な目標への貢献の間の相互依存の両方をキャプチャする。
参考スコア（独自算出の注目度）: 71.02263260394261
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite rapid development, large language models (LLMs) still encounter challenges in multi-turn decision-making tasks (i.e., agent tasks) like web shopping and browser navigation, which require making a sequence of intelligent decisions based on environmental feedback. Previous work for LLM agents typically relies on elaborate prompt engineering or fine-tuning with expert trajectories to improve performance. In this work, we take a different perspective: we explore constructing process reward models (PRMs) to evaluate each decision and guide the agent's decision-making process. Unlike LLM reasoning, where each step is scored based on correctness, actions in agent tasks do not have a clear-cut correctness. Instead, they should be evaluated based on their proximity to the goal and the progress they have made. Building on this insight, we propose a re-defined PRM for agent tasks, named AgentPRM, to capture both the interdependence between sequential decisions and their contribution to the final goal. This enables better progress tracking and exploration-exploitation balance. To scalably obtain labeled data for training AgentPRM, we employ a Temporal Difference-based (TD-based) estimation method combined with Generalized Advantage Estimation (GAE), which proves more sample-efficient than prior methods. Extensive experiments across different agentic tasks show that AgentPRM is over $8\times$ more compute-efficient than baselines, and it demonstrates robust improvement when scaling up test-time compute. Moreover, we perform detailed analyses to show how our method works and offer more insights, e.g., applying AgentPRM to the reinforcement learning of LLM agents.
Abstract（参考訳）: 急速な開発にもかかわらず、大規模言語モデル(LLM)は、Webショッピングやブラウザナビゲーションのようなマルチターン意思決定タスク(エージェントタスク)において、環境フィードバックに基づいた知的意思決定のシーケンスを必要とする課題に直面している。 LLMエージェントの以前の作業は、通常、パフォーマンスを改善するために、精巧なプロンプトエンジニアリングや専門家の軌道との微調整に頼っている。本研究は,プロセス報酬モデル(PRM)の構築と,エージェントの意思決定プロセスの導出について,異なる視点で検討する。 LLM推論とは異なり、各ステップは正確性に基づいてスコアされるが、エージェントタスクのアクションは明確な正確性を持っていない。その代わり、目標に近づいたことと、彼らが達成した進歩に基づいて評価されるべきです。そこで本研究では,エージェントタスクに対するエージェントPRM(AgentPRM)という再定義型PRMを提案する。これにより、進捗追跡と探査・探査のバランスが向上する。エージェントPRMを訓練するためのラベル付きデータを得るには,時間差に基づく(TDに基づく)推定法と一般化アドバンテージ推定(GAE)を併用し,従来の方法よりもサンプリング効率が高いことを示す。さまざまなエージェントタスクにわたる大規模な実験により、AgentPRMはベースラインよりも計算効率が良く、テストタイムの計算をスケールアップする際の堅牢な改善が示されている。さらに,LLMエージェントの強化学習にAgentPRMを適用するなど,我々の手法がどのように機能するかを詳細に分析し,より多くの知見を提供する。

論文の概要: AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress

関連論文リスト