Mind2Web: Towards a Generalist Agent for the Web
- URL: http://arxiv.org/abs/2306.06070v3
- Date: Sat, 9 Dec 2023 05:57:46 GMT
- Title: Mind2Web: Towards a Generalist Agent for the Web
- Authors: Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi
Wang, Huan Sun, Yu Su
- Abstract summary: Mind2Web is the first dataset for developing and evaluating generalist agents for the web.
With over 2,000 open-ended tasks collected from 137 websites spanning 31 domains, Mind2Web provides three necessary ingredients for building generalist web agents.
Based on Mind2Web, we conduct an initial exploration of using large language models (LLMs) for building generalist web agents.
- Score: 25.363429937913065
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Mind2Web, the first dataset for developing and evaluating
generalist agents for the web that can follow language instructions to complete
complex tasks on any website. Existing datasets for web agents either use
simulated websites or only cover a limited set of websites and tasks, thus not
suitable for generalist web agents. With over 2,000 open-ended tasks collected
from 137 websites spanning 31 domains and crowdsourced action sequences for the
tasks, Mind2Web provides three necessary ingredients for building generalist
web agents: 1) diverse domains, websites, and tasks, 2) use of real-world
websites instead of simulated and simplified ones, and 3) a broad spectrum of
user interaction patterns. Based on Mind2Web, we conduct an initial exploration
of using large language models (LLMs) for building generalist web agents. While
the raw HTML of real-world websites are often too large to be fed to LLMs, we
show that first filtering it with a small LM significantly improves the
effectiveness and efficiency of LLMs. Our solution demonstrates a decent level
of performance, even on websites or entire domains the model has never seen
before, but there is still a substantial room to improve towards truly
generalizable agents. We open-source our dataset, model implementation, and
trained models (https://osu-nlp-group.github.io/Mind2Web) to facilitate further
research on building a generalist agent for the web.
Related papers
- Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents [23.1522773245956]
We introduce a novel paradigm that augments language agents with model-based planning.
Our method, WebDreamer, builds on the key insight that LLMs inherently encode comprehensive knowledge about website structures and functionalities.
arXiv Detail & Related papers (2024-11-10T18:50:51Z) - AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents [52.13695464678006]
This study enhances an LLM-based web agent by simply refining its observation and action space.
AgentOccam surpasses the previous state-of-the-art and concurrent work by 9.8 (+29.4%) and 5.9 (+15.8%) absolute points respectively.
arXiv Detail & Related papers (2024-10-17T17:50:38Z) - Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs [112.89665642941814]
Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio.
Current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code.
We propose a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning.
arXiv Detail & Related papers (2024-06-28T17:59:46Z) - AutoWebGLM: A Large Language Model-based Web Navigating Agent [33.55199326570078]
We develop the open AutoWebGLM based on ChatGLM3-6B.
Inspired by human browsing patterns, we first design an HTML simplification algorithm to represent webpages.
We then employ a hybrid human-AI method to build web browsing data for curriculum training.
arXiv Detail & Related papers (2024-04-04T17:58:40Z) - WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models [65.18602126334716]
Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots.
We introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites.
We show that WebVoyager achieves a 59.1% task success rate on our benchmark, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups.
arXiv Detail & Related papers (2024-01-25T03:33:18Z) - VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks [93.85005277463802]
VisualWebArena is a benchmark designed to assess the performance of multimodal web agents on realistic tasks.
To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives.
arXiv Detail & Related papers (2024-01-24T18:35:21Z) - GPT-4V(ision) is a Generalist Web Agent, if Grounded [20.940613419944015]
We show that GPT-4V can successfully complete 51.1 of the tasks on live websites if we manually ground its textual plans into actions on the websites.
We propose SEEACT, a web agent that harnesses the power of LMMs for integrated visual understanding and acting on the web.
arXiv Detail & Related papers (2024-01-03T08:33:09Z) - OpenAgents: An Open Platform for Language Agents in the Wild [71.16800991568677]
We present OpenAgents, an open platform for using and hosting language agents in the wild of everyday life.
We elucidate the challenges and opportunities, aspiring to set a foundation for future research and development of real-world language agents.
arXiv Detail & Related papers (2023-10-16T17:54:53Z) - A Real-World WebAgent with Planning, Long Context Understanding, and
Program Synthesis [69.15016747150868]
We introduce WebAgent, an agent that learns from self-experience to complete tasks on real websites.
WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites.
We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks.
arXiv Detail & Related papers (2023-07-24T14:56:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.