BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks
- URL: http://arxiv.org/abs/2510.02418v2
- Date: Tue, 07 Oct 2025 15:12:39 GMT
- Title: BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks
- Authors: Sagnik Anupam, Davis Brown, Shuo Li, Eric Wong, Hamed Hassani, Osbert Bastani,
- Abstract summary: We introduce BrowserArena, a live open-web agent evaluation platform that collects user-submitted tasks.<n>We identify three consistent failure modes: captcha resolution, pop-up banner removal, and direct navigation to URLs.<n>Our findings surface both the diversity and brittleness of current web agents.
- Score: 51.803138848305814
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: LLM web agents now browse and take actions on the open web, yet current agent evaluations are constrained to sandboxed environments or artificial tasks. We introduce BrowserArena, a live open-web agent evaluation platform that collects user-submitted tasks, runs Arena-style head-to-head comparisons, and uses step-level human feedback to surface failure modes. Collecting and analyzing step-level annotations on the agent traces, we identify three consistent failure modes: captcha resolution, pop-up banner removal, and direct navigation to URLs. By constructing targeted datasets to further study these tasks, we discover variations in how different language models navigate these failure modes. We find, for example, that o4-mini deploys a wider variety of strategies to circumvent captcha resolution than other models and DeepSeek-R1 consistently misleads users about pop-up banner closure. Our findings surface both the diversity and brittleness of current web agents. More broadly, our benchmarking methodology provides an approach to evaluating and understanding web agent failure modes at scale.
Related papers
- Atomicity for Agents: Exposing, Exploiting, and Mitigating TOCTOU Vulnerabilities in Browser-Use Agents [15.381306470663695]
We present a large scale empirical study of TOCTOU vulnerabilities in browser-use agents.<n> Dynamic or adversarial web content can exploit this window to induce unintended actions.<n>We design a lightweight mitigation based on pre-execution validation.
arXiv Detail & Related papers (2026-02-28T05:25:03Z) - WALT: Web Agents that Learn Tools [66.73502484310121]
WALT is a framework that reverse-engineers latent website functionality into reusable invocable tools.<n>Rather than hypothesizing ad-hoc skills, WALT exposes robust implementations of automations already designed into websites.<n>On VisualWebArena and WebArena, WALT achieves higher success with fewer steps and less LLM-dependent reasoning.
arXiv Detail & Related papers (2025-10-01T23:41:47Z) - Detecting Pipeline Failures through Fine-Grained Analysis of Web Agents [0.48156730450374763]
This work analyzes existing benchmarks and highlights the lack of fine-grained diagnostic tools.<n>We propose a modular evaluation framework that decomposes agent pipelines into interpretable stages for detailed error analysis.
arXiv Detail & Related papers (2025-09-17T19:34:49Z) - Redefining Website Fingerprinting Attacks With Multiagent LLMs [2.7117643001081437]
Website Fingerprinting (WFP) uses deep learning models to classify encrypted network traffic to infer visited websites.<n>Our study reveals that users exhibit highly diverse behaviors even on the same website, producing traffic patterns that vary significantly across individuals.
arXiv Detail & Related papers (2025-09-15T21:17:04Z) - WebArXiv: Evaluating Multimodal Agents on Time-Invariant arXiv Tasks [7.4706262500758385]
We introduce WebArXiv, a benchmark for evaluating autonomous web agents.<n>WebArXiv consists of 275 web-based tasks grounded in the arXiv platform.<n>We propose a lightweight dynamic reflection mechanism that allows agents to selectively retrieve relevant past steps.
arXiv Detail & Related papers (2025-07-01T16:43:57Z) - WebRollback: Enhancing Web Agents with Explicit Rollback Mechanisms [52.942566473658054]
We enhance web agents with an explicit rollback mechanism, enabling the agent to revert back to a previous state in its navigation trajectory.<n>This mechanism gives the model the flexibility to directly control the search process, leading to an effective and efficient web navigation method.
arXiv Detail & Related papers (2025-04-16T05:41:20Z) - R2D2: Remembering, Replaying and Dynamic Decision Making with a Reflective Agentic Memory [53.94879482534949]
Current models often struggle with efficient navigation and action execution due to limited visibility and understanding of web structures.<n>Our proposed R2D2 framework addresses these challenges by integrating two paradigms: Remember and Reflect.<n>Our findings suggest that a combination of memory-enhanced navigation and reflective learning promisingly advances the capabilities of web agents.
arXiv Detail & Related papers (2025-01-21T20:21:58Z) - Dissecting Adversarial Robustness of Multimodal LM Agents [70.2077308846307]
We manually create 200 targeted adversarial tasks and evaluation scripts in a realistic threat model on top of VisualWebArena.<n>We find that we can successfully break latest agents that use black-box frontier LMs, including those that perform reflection and tree search.<n>We also use ARE to rigorously evaluate how the robustness changes as new components are added.
arXiv Detail & Related papers (2024-06-18T17:32:48Z) - LASER: LLM Agent with State-Space Exploration for Web Navigation [57.802977310392755]
Large language models (LLMs) have been successfully adapted for interactive decision-making tasks like web navigation.
Previous methods implicitly assume a forward-only execution mode for the model, where they only provide oracle trajectories as in-context examples.
We propose to model the interactive task as state space exploration, where the LLM agent transitions among a pre-defined set of states by performing actions to complete the task.
arXiv Detail & Related papers (2023-09-15T05:44:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.