Fugu-MT 論文翻訳(概要): AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

論文の概要: AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

arxiv url: http://arxiv.org/abs/2603.14465v1
Date: Sun, 15 Mar 2026 16:13:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.82473
Title: AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents
Title（参考訳）: AgentProcessBench: ツール使用エージェントのステップレベルプロセス品質診断
Authors: Shengda Fan, Xuyan Ye, Yupeng Huo, Zhi-Yuan Chen, Yiju Guo, Shenzhi Yang, Wenkai Yang, Shuqi Ye, Jingwen Chen, Haotian Chen, Xin Cong, Yankai Lin,
Abstract要約: 我々はAgentProcessBenchを紹介した。AgentProcessBenchは、現実的なツール拡張トラジェクトリにおけるステップレベルの有効性を評価するための最初のベンチマークである。ベンチマークは、1,000の多様な軌跡と8,509の人間ラベル付きステップアノテーションと89.1%のアノテーション間合意で構成されている。探索をキャプチャする3つのラベリングスキームと、ラベルのあいまいさを減らすためのエラー伝搬ルールを備えている。
参考スコア（独自算出の注目度）: 50.481033105867205
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in long-horizon interactions. Unlike mathematical reasoning where errors are often rectifiable via backtracking, tool-use failures frequently induce irreversible side effects, making accurate step-level verification critical. However, existing process-level benchmarks are predominantly confined to closed-world mathematical domains, failing to capture the dynamic and open-ended nature of tool execution. To bridge this gap, we introduce AgentProcessBench, the first benchmark dedicated to evaluating step-level effectiveness in realistic, tool-augmented trajectories. The benchmark comprises 1,000 diverse trajectories and 8,509 human-labeled step annotations with 89.1% inter-annotator agreement. It features a ternary labeling scheme to capture exploration and an error propagation rule to reduce labeling ambiguity. Extensive experiments reveal key insights: (1) weaker policy models exhibit inflated ratios of correct steps due to early termination; (2) distinguishing neutral and erroneous actions remains a significant challenge for current models; and (3) process-derived signals provide complementary value to outcome supervision, significantly enhancing test-time scaling. We hope AgentProcessBench can foster future research in reward models and pave the way toward general agents. The code and data are available at https://github.com/RUCBM/AgentProcessBench.
Abstract（参考訳）: 大規模言語モデル(LLM)はツール使用エージェントへと進化してきたが、長い水平相互作用において脆弱なままである。バックトラッキングによってエラーが修正される数学的推論とは異なり、ツール使用の失敗は、しばしば不可逆的な副作用を誘発し、正確なステップレベルの検証が重要となる。しかし、既存のプロセスレベルのベンチマークは主にクローズドワールドな数学的領域に限られており、ツール実行の動的でオープンな性質を捉えていない。このギャップを埋めるために、我々はAgentProcessBenchを紹介します。ベンチマークは、1,000の多様な軌跡と8,509の人間ラベル付きステップアノテーションと89.1%のアノテーション間合意で構成されている。探索をキャプチャする3つのラベリングスキームと、ラベルのあいまいさを減らすためのエラー伝搬ルールを備えている。 1)早期終了による適切なステップの膨張率を示す弱い政策モデル,(2)中立行動と誤行動の区別は現在のモデルにとって重要な課題であり,(3)プロセス由来の信号は結果の監視に補完的な価値を与え,テスト時間スケーリングを著しく向上させる。 AgentProcessBenchは、報酬モデルにおける将来の研究を奨励し、一般エージェントへの道を開くことを願っている。コードとデータはhttps://github.com/RUCBM/AgentProcessBench.comで公開されている。

論文の概要: AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

関連論文リスト