Related papers: WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks

WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks

URL: http://arxiv.org/abs/2506.01952v1
Date: Mon, 02 Jun 2025 17:59:45 GMT
Title: WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks
Authors: Atsuyuki Miyai, Zaiying Zhao, Kazuki Egashira, Atsuki Sato, Tatsumi Sunada, Shota Onohara, Hiromasa Yamanishi, Mashiro Toyooka, Kunato Nishina, Ryoma Maeda, Kiyoharu Aizawa, Toshihiko Yamasaki,
Abstract summary: We introduce WebChoreArena, a new fully reproducible benchmark comprising 532 carefully curated tasks.<n>WebChoreArena is built on top of the fully reproducible and widely adopted four WebArena simulation environments.<n>Our experimental results demonstrate that as LLMs evolve, significant improvements in performance are observed on WebChoreArena.
Score: 31.201406205897143
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Powered by a large language model (LLM), a web browsing agent operates web browsers in a human-like manner and offers a highly transparent path toward automating a wide range of everyday tasks. As web agents become increasingly capable and demonstrate proficiency in general browsing tasks, a critical question emerges: Can they go beyond general browsing to robustly handle tasks that are tedious and complex, or chores that humans often avoid doing themselves? In this paper, we introduce WebChoreArena, a new fully reproducible benchmark comprising 532 carefully curated tasks designed to extend the scope of WebArena beyond general browsing to more labor-intensive and tedious tasks. WebChoreArena systematically integrates three key challenges: (i) Massive Memory tasks requiring accurate retrieval of large amounts of information in the observations, (ii) Calculation tasks demanding precise mathematical reasoning, and (iii) Long-Term Memory tasks necessitating long-term memory across multiple webpages. Built on top of the fully reproducible and widely adopted four WebArena simulation environments, WebChoreArena ensures strict reproducibility and enables fair, direct comparisons with the established WebArena benchmark, offering key insights into agent progress. Our experimental results demonstrate that as LLMs evolve, represented by GPT-4o, Claude 3.7 Sonnet, and Gemini 2.5 Pro, significant improvements in performance are observed on WebChoreArena. These findings suggest that WebChoreArena is well-suited to measure the advancement of state-of-the-art LLMs with greater clarity. Nevertheless, the results also indicate that even with Gemini 2.5 Pro, there remains substantial room for improvement compared to WebArena, highlighting the increased challenges posed by WebChoreArena.

Related papers

Avenir-Web: Human-Experience-Imitating Multimodal Web Agents with Mixture of Grounding Experts [59.68272935616536]
Avenir-Web is a web agent that achieves a new open-source state of the art on the Online-Mind2Web benchmark in real-world deployment.<n>We evaluate Avenir-Web on Online-Mind2Web, a rigorous benchmark of live and user-centered web tasks.
arXiv Detail & Related papers (2026-02-02T18:50:07Z)
WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents [31.554790282560443]
We introduce WebArbiter, a principle-inducing WebPRM that formulates reward modeling as text generation.<n>WebArbiter produces structured justifications that conclude with a preference verdict and identify the action most conducive to task completion.
arXiv Detail & Related papers (2026-01-29T15:39:50Z)
WebCoach: Self-Evolving Web Agents with Cross-Session Memory Guidance [29.57207599604568]
WebCoach is a model-agnostic self-evolving framework that equips web browsing agents with persistent cross-session memory.<n>WebCoach achieves self-evolution by continuously curating episodic memory from new navigation trajectories.<n> Evaluations on the WebVoyager benchmark demonstrate that WebCoach consistently improves the performance of browser-use agents.
arXiv Detail & Related papers (2025-11-17T05:38:50Z)
WebDART: Dynamic Decomposition and Re-planning for Complex Web Tasks [30.48395228595732]
Large language model (LLM) agents are becoming competent at straightforward web tasks, but struggle with objectives that require long horizon navigation, large scale information extraction, and reasoning under constraints.<n>We present WebDART, a general framework that enables a single LLM to handle such complex chores.<n>WebDART lifts success rates by up to 13.7 percentage points over previous SOTA agents, while matching their performance on the easier WebArena suite.
arXiv Detail & Related papers (2025-10-08T02:34:59Z)
BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks [51.803138848305814]
We introduce BrowserArena, a live open-web agent evaluation platform that collects user-submitted tasks.<n>We identify three consistent failure modes: captcha resolution, pop-up banner removal, and direct navigation to URLs.<n>Our findings surface both the diversity and brittleness of current web agents.
arXiv Detail & Related papers (2025-10-02T15:22:21Z)
WebCoT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought in Reflection, Branching, and Rollback [74.82886755416949]
We identify key reasoning skills essential for effective web agents.<n>We reconstruct the agent's reasoning algorithms into chain-of-thought rationales.<n>Our approach yields significant improvements across multiple benchmarks.
arXiv Detail & Related papers (2025-05-26T14:03:37Z)
Web-Shepherd: Advancing PRMs for Reinforcing Web Agents [12.928605558358464]
We propose the first process reward model (PRM) called Web-Shepherd to assess web navigation trajectories in a step-level.<n>In experiments, we observe that our Web-Shepherd achieves about 30 points better accuracy compared to using GPT-4o on WebRewardBench.
arXiv Detail & Related papers (2025-05-21T08:56:55Z)
A Survey of WebAgents: Towards Next-Generation AI Agents for Web Automation with Large Foundation Models [45.12763718252896]
In the context of the web, leveraging AI Agents -- WebAgents -- to automatically assist people in handling tedious daily tasks can dramatically enhance productivity and efficiency.<n>To fully explore the potential of LFMs, extensive research has emerged on WebAgents designed to complete daily web tasks according to user instructions.
arXiv Detail & Related papers (2025-03-30T08:15:44Z)
R2D2: Remembering, Reflecting and Dynamic Decision Making for Web Agents [53.94879482534949]
Current models often struggle with efficient navigation and action execution due to limited visibility and understanding of web structures.<n>Our proposed R2D2 framework addresses these challenges by integrating two paradigms: Remember and Reflect.<n>Our findings suggest that a combination of memory-enhanced navigation and reflective learning promisingly advances the capabilities of web agents.
arXiv Detail & Related papers (2025-01-21T20:21:58Z)
The BrowserGym Ecosystem for Web Agent Research [151.90034093362343]
BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents.<n>We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature.<n>We conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks.
arXiv Detail & Related papers (2024-12-06T23:43:59Z)
WebCanvas: Benchmarking Web Agents in Online Environments [29.278363444725628]
WebCanvas is an innovative online evaluation framework for web agents. We open-source an agent framework with modules for reasoning, providing a foundation for the community to conduct online inference and evaluations. Our best-performing agent achieves a task success rate of 23.1% and a task completion rate of 48.8% on the Mind2Web-Live test set.
arXiv Detail & Related papers (2024-06-18T07:58:33Z)
AutoWebGLM: A Large Language Model-based Web Navigating Agent [33.55199326570078]
We develop the open AutoWebGLM based on ChatGLM3-6B. Inspired by human browsing patterns, we first design an HTML simplification algorithm to represent webpages. We then employ a hybrid human-AI method to build web browsing data for curriculum training.
arXiv Detail & Related papers (2024-04-04T17:58:40Z)
On the Multi-turn Instruction Following for Conversational Web Agents [83.51251174629084]
We introduce a new task of Conversational Web Navigation, which necessitates sophisticated interactions that span multiple turns with both the users and the environment. We propose a novel framework, named self-reflective memory-augmented planning (Self-MAP), which employs memory utilization and self-reflection techniques.
arXiv Detail & Related papers (2024-02-23T02:18:12Z)
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models [65.18602126334716]
Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots. We introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites. We show that WebVoyager achieves a 59.1% task success rate on our benchmark, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups.
arXiv Detail & Related papers (2024-01-25T03:33:18Z)
Multimodal Web Navigation with Instruction-Finetuned Foundation Models [99.14209521903854]
We study data-driven offline training for web agents with vision-language foundation models. We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages. We empirically demonstrate this recipe improves the agent's ability of grounded multimodal perception, HTML comprehension, and multi-step reasoning.
arXiv Detail & Related papers (2023-05-19T17:44:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.