Fugu-MT 論文翻訳(概要): Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling

論文の概要: Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling

arxiv url: http://arxiv.org/abs/2604.08178v1
Date: Thu, 09 Apr 2026 12:35:06 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-10 18:34:05.917522
Title: Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling
Title（参考訳）: 計画によるエージェントの調整:軌道レベルリワードモデリングのためのベンチマーク
Authors: Jiaxuan Wang, Yulan Hu, Wenjin Yang, Zheng Pan, Xin Li, Lan-Zhe Guo,
Abstract要約: Plan-RewardBench(プラン・リワードベンチ)は、判断者が選好と選好の選好をいかに区別するかを評価するために設計された軌道レベルの選好ベンチマークである。 Plan-RewardBench は、 (i) Safety Refusal、 (ii) Tool-Irrelevance / Unavailability、 (iii) Complex Planning、 (iv) Robust Error Recovery の4つの代表的なタスクファミリをカバーする。
参考スコア（独自算出の注目度）: 19.766968596602457
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In classical Reinforcement Learning from Human Feedback (RLHF), Reward Models (RMs) serve as the fundamental signal provider for model alignment. As Large Language Models evolve into agentic systems capable of autonomous tool invocation and complex reasoning, the paradigm of reward modeling faces unprecedented challenges--most notably, the lack of benchmarks specifically designed to assess RM capabilities within tool-integrated environments. To address this gap, we present Plan-RewardBench, a trajectory-level preference benchmark designed to evaluate how well judges distinguish preferred versus distractor agent trajectories in complex tool-using scenarios. Plan-RewardBench covers four representative task families -- (i) Safety Refusal, (ii) Tool-Irrelevance / Unavailability, (iii) Complex Planning, and (iv) Robust Error Recovery -- comprising validated positive trajectories and confusable hard negatives constructed via multi-model natural rollouts, rule-based perturbations, and minimal-edit LLM perturbations. We benchmark representative RMs (generative, discriminative, and LLM-as-Judge) under a unified pairwise protocol, reporting accuracy trends across varying trajectory lengths and task categories. Furthermore, we provide diagnostic analyses of prevalent failure modes. Our results reveal that all three evaluator families face substantial challenges, with performance degrading sharply on long-horizon trajectories, underscoring the necessity for specialized training in agentic, trajectory-level reward modeling. Ultimately, Plan-RewardBench aims to serve as both a practical evaluation suite and a reusable blueprint for constructing agentic planning preference data.
Abstract（参考訳）: 古典的強化学習(Reinforcement Learning from Human Feedback, RLHF)では、リワードモデル(Reward Models, RM)がモデルアライメントの基本的な信号プロバイダとして機能する。大きな言語モデルが自律的なツールの実行と複雑な推論が可能なエージェントシステムへと進化するにつれて、報酬モデリングのパラダイムは前例のない課題に直面します。このギャップに対処するために、複雑なツールのシナリオにおいて、判断者が好ましくないエージェントの軌跡をいかに区別するかを評価するために設計された、軌道レベルの選好ベンチマークであるPlan-RewardBenchを提案する。 Plan-RewardBenchは4つの代表的なタスクファミリーをカバーする。一安全の拒絶 (ii)ツール関連/利用性 (三)複合計画、及び (iv)ロバストエラー回復 -- 実証された正の軌道と、マルチモデル自然転がり、ルールベースの摂動、最小限のLLM摂動によって構築された難解なハード負を含む。汎用RM(生成, 識別, LLM-as-Judge)をペアワイズプロトコルでベンチマークし, 軌道長やタスクカテゴリの精度の傾向を報告する。さらに,本研究では,有意な障害モードの診断分析を行う。以上の結果から, 3家族とも, 長期軌跡に急激な劣化がみられ, エージェントレベルの報酬モデルにおいて, 専門訓練の必要性が浮き彫りにされていることが明らかとなった。最終的にPlan-RewardBenchは、エージェント計画選好データを構築するための実用的評価スイートと再利用可能な青写真の両方として機能することを目指している。

論文の概要: Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling

関連論文リスト