WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents
- URL: http://arxiv.org/abs/2601.21872v1
- Date: Thu, 29 Jan 2026 15:39:50 GMT
- Title: WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents
- Authors: Yao Zhang, Shijie Tang, Zeyu Li, Zhen Han, Volker Tresp,
- Abstract summary: We introduce WebArbiter, a principle-inducing WebPRM that formulates reward modeling as text generation.<n>WebArbiter produces structured justifications that conclude with a preference verdict and identify the action most conducive to task completion.
- Score: 31.554790282560443
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Web agents hold great potential for automating complex computer tasks, yet their interactions involve long-horizon, sequential decision-making with irreversible actions. In such settings, outcome-based supervision is sparse and delayed, often rewarding incorrect trajectories and failing to support inference-time scaling. This motivates the use of Process Reward Models (WebPRMs) for web navigation, but existing approaches remain limited: scalar WebPRMs collapse progress into coarse, weakly grounded signals, while checklist-based WebPRMs rely on brittle template matching that fails under layout or semantic changes and often mislabels superficially correct actions as successful, providing little insight or interpretability. To address these challenges, we introduce WebArbiter, a reasoning-first, principle-inducing WebPRM that formulates reward modeling as text generation, producing structured justifications that conclude with a preference verdict and identify the action most conducive to task completion under the current context. Training follows a two-stage pipeline: reasoning distillation equips the model with coherent principle-guided reasoning, and reinforcement learning corrects teacher biases by directly aligning verdicts with correctness, enabling stronger generalization. To support systematic evaluation, we release WebPRMBench, a comprehensive benchmark spanning four diverse web environments with rich tasks and high-quality preference annotations. On WebPRMBench, WebArbiter-7B outperforms the strongest baseline, GPT-5, by 9.1 points. In reward-guided trajectory search on WebArena-Lite, it surpasses the best prior WebPRM by up to 7.2 points, underscoring its robustness and practical value in real-world complex web tasks.
Related papers
- OpAgent: Operator Agent for Web Navigation [23.928869500029432]
We develop an online interaction environment and fine-tune the Vision-Language Model (VLM) using a specialized RL pipeline.<n>We introduce a Hybrid Reward Mechanism that combines a ground-truth-agnostic WebJudge for holistic outcome assessment and a Rule-based Decision Tree (RDT) for progress reward.<n> Notably, our RL-enhanced model achieves a 38.1% success rate (pass@5) on WebArena, outperforming all existing monolithic baselines.
arXiv Detail & Related papers (2026-02-14T02:33:55Z) - Steering LLMs via Scalable Interactive Oversight [74.12746881843044]
Large Language Models increasingly automate complex, long-horizon tasks such as emphvibe coding, a supervision gap has emerged.<n>It presents a critical challenge in scalable oversight: enabling humans to responsibly steer AI systems on tasks that surpass their own ability to specify or verify.
arXiv Detail & Related papers (2026-02-04T04:52:00Z) - It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents [52.81924177620322]
Web-based agents powered by large language models are increasingly used for tasks such as email management or professional networking.<n>Their reliance on dynamic web content makes them vulnerable to prompt injection attacks: adversarial instructions hidden in interface elements that persuade the agent to divert from its original task.<n>We introduce the Task-Redirecting Agent Persuasion Benchmark (TRAP), an evaluation for studying how persuasion techniques misguide autonomous web agents on realistic tasks.
arXiv Detail & Related papers (2025-12-29T01:09:10Z) - TGPO: Tree-Guided Preference Optimization for Robust Web Agent Reinforcement Learning [4.456860697635325]
Training Web Agents with reinforcement learning faces critical challenges including credit assignment misallocation, prohibitively high annotation costs, and reward sparsity.<n>Our framework incorporates a Process Reward Model that automatically generates fine-grained rewards through subgoal progress, redundancy detection, and action verification.<n>Experiments on Online-Mind2Web and our self-constructed C-WebShop datasets demonstrate that TGPO significantly outperforms existing methods.
arXiv Detail & Related papers (2025-09-17T16:58:44Z) - WebArXiv: Evaluating Multimodal Agents on Time-Invariant arXiv Tasks [7.4706262500758385]
We introduce WebArXiv, a benchmark for evaluating autonomous web agents.<n>WebArXiv consists of 275 web-based tasks grounded in the arXiv platform.<n>We propose a lightweight dynamic reflection mechanism that allows agents to selectively retrieve relevant past steps.
arXiv Detail & Related papers (2025-07-01T16:43:57Z) - WebCoT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought in Reflection, Branching, and Rollback [78.55946306325914]
We identify key reasoning skills essential for effective web agents.<n>We reconstruct the agent's reasoning algorithms into chain-of-thought rationales.<n>Our approach yields significant improvements across multiple benchmarks.
arXiv Detail & Related papers (2025-05-26T14:03:37Z) - Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction [83.0216122783429]
Web Reconstruction (WebR) is a fully automated framework for synthesizing high-quality instruction-tuning (IT) data directly from raw web documents.<n>We show that datasets generated by WebR outperform state-of-the-art baselines by up to 16.65% across four instruction-following benchmarks.
arXiv Detail & Related papers (2025-04-22T04:07:13Z) - WebRollback: Enhancing Web Agents with Explicit Rollback Mechanisms [52.942566473658054]
We enhance web agents with an explicit rollback mechanism, enabling the agent to revert back to a previous state in its navigation trajectory.<n>This mechanism gives the model the flexibility to directly control the search process, leading to an effective and efficient web navigation method.
arXiv Detail & Related papers (2025-04-16T05:41:20Z) - R2D2: Remembering, Replaying and Dynamic Decision Making with a Reflective Agentic Memory [53.94879482534949]
Current models often struggle with efficient navigation and action execution due to limited visibility and understanding of web structures.<n>Our proposed R2D2 framework addresses these challenges by integrating two paradigms: Remember and Reflect.<n>Our findings suggest that a combination of memory-enhanced navigation and reflective learning promisingly advances the capabilities of web agents.
arXiv Detail & Related papers (2025-01-21T20:21:58Z) - The BrowserGym Ecosystem for Web Agent Research [151.90034093362343]
BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents.<n>We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature.<n>We conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks.
arXiv Detail & Related papers (2024-12-06T23:43:59Z) - WebSuite: Systematically Evaluating Why Web Agents Fail [2.200477647229223]
We describe WebSuite, the first diagnostic benchmark for generalist web agents.
This benchmark suite consists of both individual tasks, such as clicking a button, and end-to-end tasks, such as adding an item to a cart.
We evaluate two popular generalist web agents, one text-based and one multimodal, and identify unique weaknesses for each agent.
arXiv Detail & Related papers (2024-06-01T00:32:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.