WebCanvas: Benchmarking Web Agents in Online Environments
- URL: http://arxiv.org/abs/2406.12373v3
- Date: Tue, 16 Jul 2024 06:19:21 GMT
- Title: WebCanvas: Benchmarking Web Agents in Online Environments
- Authors: Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, Zhengyang Wu,
- Abstract summary: WebCanvas is an innovative online evaluation framework for web agents.
We open-source an agent framework with modules for reasoning, providing a foundation for the community to conduct online inference and evaluations.
Our best-performing agent achieves a task success rate of 23.1% and a task completion rate of 48.8% on the Mind2Web-Live test set.
- Score: 29.278363444725628
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For web agents to be practically useful, they must adapt to the continuously evolving web environment characterized by frequent updates to user interfaces and content. However, most existing benchmarks only capture the static aspects of the web. To bridge this gap, we introduce WebCanvas, an innovative online evaluation framework for web agents that effectively addresses the dynamic nature of web interactions. WebCanvas contains three main components to facilitate realistic assessments: (1) A novel evaluation metric which reliably capture critical intermediate actions or states necessary for task completions while disregarding noise caused by insignificant events or changed web-elements. (2) A benchmark dataset called Mind2Web-Live, a refined version of original Mind2Web static dataset containing 542 tasks with 2439 intermediate evaluation states; (3) Lightweight and generalizable annotation tools and testing pipelines that enables the community to collect and maintain the high-quality, up-to-date dataset. Building on WebCanvas, we open-source an agent framework with extensible modules for reasoning, providing a foundation for the community to conduct online inference and evaluations. Our best-performing agent achieves a task success rate of 23.1% and a task completion rate of 48.8% on the Mind2Web-Live test set. Additionally, we analyze the performance discrepancies across various websites, domains, and experimental environments. We encourage the community to contribute further insights on online agent evaluation, thereby advancing this field of research.
Related papers
- AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks? [50.36826943689364]
We study whether language agents can perform realistic and time-consuming tasks on the web.
We introduce AssistantBench, a new benchmark consisting of 214 realistic tasks that can be automatically evaluated.
We find that AssistantBench exposes the limitations of current systems, including language models and retrieval-augmented language models.
arXiv Detail & Related papers (2024-07-22T15:18:45Z) - WebSuite: Systematically Evaluating Why Web Agents Fail [2.200477647229223]
We describe WebSuite, the first diagnostic benchmark for generalist web agents.
This benchmark suite consists of both individual tasks, such as clicking a button, and end-to-end tasks, such as adding an item to a cart.
We evaluate two popular generalist web agents, one text-based and one multimodal, and identify unique weaknesses for each agent.
arXiv Detail & Related papers (2024-06-01T00:32:26Z) - MMInA: Benchmarking Multihop Multimodal Internet Agents [36.173995299002]
We present MMInA, a multihop and multimodal benchmark to evaluate the embodied agents for compositional Internet tasks.
Our data includes 1,050 human-written tasks covering various domains such as shopping and travel.
Our method significantly improved both the single-hop and multihop web browsing abilities of agents.
arXiv Detail & Related papers (2024-04-15T17:59:50Z) - WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? [83.19032025950986]
We study the use of large language model-based agents for interacting with software via web browsers.
WorkArena is a benchmark of 33 tasks based on the widely-used ServiceNow platform.
BrowserGym is an environment for the design and evaluation of such agents.
arXiv Detail & Related papers (2024-03-12T14:58:45Z) - Refining Segmentation On-the-Fly: An Interactive Framework for Point
Cloud Semantic Segmentation [9.832150567595718]
We present the first interactive framework for point cloud semantic segmentation, named InterPCSeg.
We develop an interaction simulation scheme tailored for the interactive point cloud semantic segmentation task.
We evaluate our framework on the S3DIS and ScanNet datasets with off-the-shelf segmentation networks.
arXiv Detail & Related papers (2024-03-11T03:24:58Z) - WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models [65.18602126334716]
Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots.
We introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites.
We show that WebVoyager achieves a 59.1% task success rate on our benchmark, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups.
arXiv Detail & Related papers (2024-01-25T03:33:18Z) - VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks [93.85005277463802]
VisualWebArena is a benchmark designed to assess the performance of multimodal web agents on realistic tasks.
To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives.
arXiv Detail & Related papers (2024-01-24T18:35:21Z) - WebArena: A Realistic Web Environment for Building Autonomous Agents [92.3291458543633]
We build an environment for language-guided agents that is highly realistic and reproducible.
We focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains.
We release a set of benchmark tasks focusing on evaluating the functional correctness of task completions.
arXiv Detail & Related papers (2023-07-25T22:59:32Z) - A Real-World WebAgent with Planning, Long Context Understanding, and
Program Synthesis [69.15016747150868]
We introduce WebAgent, an agent that learns from self-experience to complete tasks on real websites.
WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites.
We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks.
arXiv Detail & Related papers (2023-07-24T14:56:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.