Fugu-MT 論文翻訳(概要): WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents

論文の概要: WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents

arxiv url: http://arxiv.org/abs/2601.21872v1
Date: Thu, 29 Jan 2026 15:39:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-30 16:22:49.944272
Title: WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents
Title（参考訳）: WebArbiter: Webエージェントのための原則的推論プロセスリワードモデル
Authors: Yao Zhang, Shijie Tang, Zeyu Li, Zhen Han, Volker Tresp,
Abstract要約: 本稿では、報酬モデリングをテキスト生成として定式化するWebPRMであるWebArbiterを紹介する。 WebArbiterは、好みの判断で結論付ける構造化された正当化を生成し、タスク完了に最も寄与するアクションを識別する。
参考スコア（独自算出の注目度）: 31.554790282560443
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Web agents hold great potential for automating complex computer tasks, yet their interactions involve long-horizon, sequential decision-making with irreversible actions. In such settings, outcome-based supervision is sparse and delayed, often rewarding incorrect trajectories and failing to support inference-time scaling. This motivates the use of Process Reward Models (WebPRMs) for web navigation, but existing approaches remain limited: scalar WebPRMs collapse progress into coarse, weakly grounded signals, while checklist-based WebPRMs rely on brittle template matching that fails under layout or semantic changes and often mislabels superficially correct actions as successful, providing little insight or interpretability. To address these challenges, we introduce WebArbiter, a reasoning-first, principle-inducing WebPRM that formulates reward modeling as text generation, producing structured justifications that conclude with a preference verdict and identify the action most conducive to task completion under the current context. Training follows a two-stage pipeline: reasoning distillation equips the model with coherent principle-guided reasoning, and reinforcement learning corrects teacher biases by directly aligning verdicts with correctness, enabling stronger generalization. To support systematic evaluation, we release WebPRMBench, a comprehensive benchmark spanning four diverse web environments with rich tasks and high-quality preference annotations. On WebPRMBench, WebArbiter-7B outperforms the strongest baseline, GPT-5, by 9.1 points. In reward-guided trajectory search on WebArena-Lite, it surpasses the best prior WebPRM by up to 7.2 points, underscoring its robustness and practical value in real-world complex web tasks.
Abstract（参考訳）: Webエージェントは複雑なコンピュータタスクを自動化する大きな可能性を秘めている。このような設定では、結果ベースの監視はスパースで遅延し、しばしば誤った軌跡を報い、推論時間スケーリングをサポートしない。これはWebナビゲーションにProcess Reward Models (WebPRMs) を使うことを動機としているが、既存のアプローチには制限がある。これらの課題に対処するため、WebArbiterは、報酬モデリングをテキスト生成として定式化し、好みの判断で結論付ける構造化された正当化を生成し、現在の文脈下でタスク完了に最も寄与するアクションを特定する、推論ファーストで原則を導出するWebPRMである。蒸留の推論は、コヒーレントな原理誘導推論をモデルに装備し、強化学習は、評定を直接正当性で整列させることで教師のバイアスを補正し、より強力な一般化を可能にする。 WebPRMBenchは4つの多様なWeb環境にまたがる、リッチなタスクと高品質な嗜好アノテーションを備えた総合的なベンチマークである。 WebPRMBenchでは、WebArbiter-7Bが最強のベースラインであるGPT-5を9.1ポイント上回っている。報酬誘導によるWebArena-Liteの軌道探索では、WebPRMの上位7.2ポイントを超え、実世界の複雑なWebタスクにおける堅牢性と実用的価値を強調している。

論文の概要: WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents

関連論文リスト