Related papers: GPT-4V(ision) is a Generalist Web Agent, if Grounded

GPT-4V(ision) is a Generalist Web Agent, if Grounded

URL: http://arxiv.org/abs/2401.01614v2
Date: Tue, 12 Mar 2024 23:14:33 GMT
Title: GPT-4V(ision) is a Generalist Web Agent, if Grounded
Authors: Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su
Abstract summary: We show that GPT-4V can successfully complete 51.1 of the tasks on live websites if we manually ground its textual plans into actions on the websites. We propose SEEACT, a web agent that harnesses the power of LMMs for integrated visual understanding and acting on the web.
Score: 20.940613419944015
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The recent development on large multimodal models (LMMs), especially GPT-4V(ision) and Gemini, has been quickly expanding the capability boundaries of multimodal models beyond traditional tasks like image captioning and visual question answering. In this work, we explore the potential of LMMs like GPT-4V as a generalist web agent that can follow natural language instructions to complete tasks on any given website. We propose SEEACT, a generalist web agent that harnesses the power of LMMs for integrated visual understanding and acting on the web. We evaluate on the recent MIND2WEB benchmark. In addition to standard offline evaluation on cached websites, we enable a new online evaluation setting by developing a tool that allows running web agents on live websites. We show that GPT-4V presents a great potential for web agents -- it can successfully complete 51.1 of the tasks on live websites if we manually ground its textual plans into actions on the websites. This substantially outperforms text-only LLMs like GPT-4 or smaller models (FLAN-T5 and BLIP-2) specifically fine-tuned for web agents. However, grounding still remains a major challenge. Existing LMM grounding strategies like set-of-mark prompting turns out to be not effective for web agents, and the best grounding strategy we develop in this paper leverages both the HTML structure and visuals. Yet, there is still a substantial gap with oracle grounding, leaving ample room for further improvement. All code, data, and evaluation tools are available at https://github.com/OSU-NLP-Group/SeeAct.

Related papers

Mirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills [57.740236400672046]
We propose a Hierarchical Multimodal Skills (HMS) module to tackle the issue of insufficient knowledge.<n>It progressively abstracts trajectories into execution skills, core skills, and ultimately meta-skills, providing a hierarchical knowledge structure for long-horizon task planning.<n>To bridge the domain gap, we propose the Skill-Augmented Monte Carlo Tree Search (SA-MCTS) algorithm.
arXiv Detail & Related papers (2025-06-12T06:21:19Z)
Web-Shepherd: Advancing PRMs for Reinforcing Web Agents [12.928605558358464]
We propose the first process reward model (PRM) called Web-Shepherd to assess web navigation trajectories in a step-level.<n>In experiments, we observe that our Web-Shepherd achieves about 30 points better accuracy compared to using GPT-4o on WebRewardBench.
arXiv Detail & Related papers (2025-05-21T08:56:55Z)
AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents [52.13695464678006]
This study enhances an LLM-based web agent by simply refining its observation and action space. AgentOccam surpasses the previous state-of-the-art and concurrent work by 9.8 (+29.4%) and 5.9 (+15.8%) absolute points respectively.
arXiv Detail & Related papers (2024-10-17T17:50:38Z)
VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? [115.60866817774641]
Multimodal Large Language models (MLLMs) have shown promise in web-related tasks. evaluating their performance in the web domain remains a challenge due to the lack of comprehensive benchmarks. bench is a multimodal benchmark designed to assess the capabilities of MLLMs across a variety of web tasks.
arXiv Detail & Related papers (2024-04-09T02:29:39Z)
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models [65.18602126334716]
Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots. We introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites. We show that WebVoyager achieves a 59.1% task success rate on our benchmark, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups.
arXiv Detail & Related papers (2024-01-25T03:33:18Z)
AllTogether: Investigating the Efficacy of Spliced Prompt for Web Navigation using Large Language Models [2.234037966956278]
We introduce AllTogether, a standardized prompt template that enhances task context representation. We evaluate the efficacy of this approach through prompt learning and instruction finetuning based on open-source Llama-2 and API-accessible GPT models.
arXiv Detail & Related papers (2023-10-20T11:10:14Z)
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis [69.15016747150868]
We introduce WebAgent, an agent that learns from self-experience to complete tasks on real websites. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites. We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks.
arXiv Detail & Related papers (2023-07-24T14:56:30Z)
Mind2Web: Towards a Generalist Agent for the Web [25.363429937913065]
Mind2Web is the first dataset for developing and evaluating generalist agents for the web. With over 2,000 open-ended tasks collected from 137 websites spanning 31 domains, Mind2Web provides three necessary ingredients for building generalist web agents. Based on Mind2Web, we conduct an initial exploration of using large language models (LLMs) for building generalist web agents.
arXiv Detail & Related papers (2023-06-09T17:44:31Z)
Multimodal Web Navigation with Instruction-Finetuned Foundation Models [99.14209521903854]
We study data-driven offline training for web agents with vision-language foundation models. We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages. We empirically demonstrate this recipe improves the agent's ability of grounded multimodal perception, HTML comprehension, and multi-step reasoning.
arXiv Detail & Related papers (2023-05-19T17:44:34Z)
Understanding HTML with Large Language Models [73.92747433749271]
Large language models (LLMs) have shown exceptional performance on a variety of natural language tasks. We contribute HTML understanding models (fine-tuned LLMs) and an in-depth analysis of their capabilities under three tasks. We show that LLMs pretrained on standard natural language corpora transfer remarkably well to HTML understanding tasks.
arXiv Detail & Related papers (2022-10-08T07:27:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.