WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
- URL: http://arxiv.org/abs/2401.13919v4
- Date: Thu, 6 Jun 2024 18:37:34 GMT
- Title: WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
- Authors: Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, Dong Yu,
- Abstract summary: Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots.
We introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites.
We show that WebVoyager achieves a 59.1% task success rate on our benchmark, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups.
- Score: 65.18602126334716
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid advancement of large language models (LLMs) has led to a new era marked by the development of autonomous applications in real-world scenarios, which drives innovation in creating advanced web agents. Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots, greatly limiting their applicability in real-world scenarios. To bridge this gap, we introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites. Moreover, we establish a new benchmark by compiling real-world tasks from 15 popular websites and introduce an automatic evaluation protocol leveraging multimodal understanding abilities of GPT-4V to evaluate open-ended web agents. We show that WebVoyager achieves a 59.1% task success rate on our benchmark, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups, underscoring the exceptional capability of WebVoyager. The proposed automatic evaluation metric achieves 85.3% agreement with human judgment, indicating its effectiveness in providing reliable and accurate assessments of web agents.
Related papers
- Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents [64.75036903373712]
Proposer-Agent-Evaluator is a learning system that enables foundation model agents to autonomously discover and practice skills in the wild.
At the heart of PAE is a context-aware task proposer that autonomously proposes tasks for the agent to practice with context information.
The success evaluation serves as the reward signal for the agent to refine its policies through RL.
arXiv Detail & Related papers (2024-12-17T18:59:50Z) - The BrowserGym Ecosystem for Web Agent Research [151.90034093362343]
BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents.
We conduct the first large-scale, multi-benchmark web agent experiment.
Results highlight a large discrepancy between OpenAI and Anthropic's latests models.
arXiv Detail & Related papers (2024-12-06T23:43:59Z) - AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations [18.820883566002543]
State-of-the-art multimodal web agents, powered by Multimodal Large Language Models (MLLMs), can autonomously execute many web tasks.
Current strategies for building web agents rely on (i) the generalizability of underlying MLLMs and their steerability via prompting, and (ii) large-scale fine-tuning of MLLMs on web-related tasks.
We introduce the AdaptAgent framework that enables both proprietary and open-weights multimodal web agents to adapt to new websites and domains using few human demonstrations.
arXiv Detail & Related papers (2024-11-20T16:54:15Z) - AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents [52.13695464678006]
This study enhances an LLM-based web agent by simply refining its observation and action space.
AgentOccam surpasses the previous state-of-the-art and concurrent work by 9.8 (+29.4%) and 5.9 (+15.8%) absolute points respectively.
arXiv Detail & Related papers (2024-10-17T17:50:38Z) - Multimodal Auto Validation For Self-Refinement in Web Agents [0.5843533603338313]
This paper introduces an approach to improving web agent performance through multi-modal validation and self-refinement.
We present a comprehensive study of different modalities (text, vision) and the effect of hierarchy for the automatic validation of web agents.
We also introduce a self-refinement mechanism for web automation, using the developed auto-validator, that enables web agents to detect and self-correct workflow failures.
arXiv Detail & Related papers (2024-10-01T13:43:55Z) - WIPI: A New Web Threat for LLM-Driven Web Agents [28.651763099760664]
We introduce a novel threat, WIPI, that indirectly controls Web Agent to execute malicious instructions embedded in publicly accessible webpages.
To launch a successful WIPI works in a black-box environment.
Our methodology achieves an average attack success rate (ASR) exceeding 90% even in pure black-box scenarios.
arXiv Detail & Related papers (2024-02-26T19:01:54Z) - VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks [93.85005277463802]
VisualWebArena is a benchmark designed to assess the performance of multimodal web agents on realistic tasks.
To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives.
arXiv Detail & Related papers (2024-01-24T18:35:21Z) - A Real-World WebAgent with Planning, Long Context Understanding, and
Program Synthesis [69.15016747150868]
We introduce WebAgent, an agent that learns from self-experience to complete tasks on real websites.
WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites.
We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks.
arXiv Detail & Related papers (2023-07-24T14:56:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.