A Real-World WebAgent with Planning, Long Context Understanding, and
Program Synthesis
- URL: http://arxiv.org/abs/2307.12856v4
- Date: Sun, 25 Feb 2024 16:17:43 GMT
- Title: A Real-World WebAgent with Planning, Long Context Understanding, and
Program Synthesis
- Authors: Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka
Matsuo, Douglas Eck, Aleksandra Faust
- Abstract summary: We introduce WebAgent, an agent that learns from self-experience to complete tasks on real websites.
WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites.
We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks.
- Score: 69.15016747150868
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-trained large language models (LLMs) have recently achieved better
generalization and sample efficiency in autonomous web automation. However, the
performance on real-world websites has still suffered from (1) open domainness,
(2) limited context length, and (3) lack of inductive bias on HTML. We
introduce WebAgent, an LLM-driven agent that learns from self-experience to
complete tasks on real websites following natural language instructions.
WebAgent plans ahead by decomposing instructions into canonical
sub-instructions, summarizes long HTML documents into task-relevant snippets,
and acts on websites via Python programs generated from those. We design
WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new
pre-trained LLMs for long HTML documents using local and global attention
mechanisms and a mixture of long-span denoising objectives, for planning and
summarization. We empirically demonstrate that our modular recipe improves the
success on real websites by over 50%, and that HTML-T5 is the best model to
solve various HTML understanding tasks; achieving 18.7% higher success rate
than the prior method on MiniWoB web automation benchmark, and SoTA performance
on Mind2Web, an offline task planning evaluation.
Related papers
- Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs [112.89665642941814]
Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio.
Current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code.
We propose Web2Code, a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning.
arXiv Detail & Related papers (2024-06-28T17:59:46Z) - CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks with Front-End UI Only [21.054681757006385]
Large Language Models (LLMs) with advanced reasoning capabilities have set the stage for agents to undertake more complex and previously unseen tasks.
We propose an agent that functions solely on the basis of screenshots for recognizing environments.
We achieve a success rate of 94.4% on 67types of MiniWoB++ problems, utilizing only 1.48demonstrations per problem type.
arXiv Detail & Related papers (2024-06-11T05:21:20Z) - AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation [55.86438100985539]
We introduce a crawler generation task for vertical information web pages.
We propose AutoCrawler, a two-stage framework that leverages the hierarchical structure of HTML for progressive understanding.
arXiv Detail & Related papers (2024-04-19T09:59:44Z) - AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent [33.55199326570078]
AutoWebGLM is an automated web navigation agent built upon ChatGLM3-6B.
Inspired by human browsing patterns, we design an HTML simplification algorithm to represent webpages.
For testing, we establish a bilingual benchmark -- AutoWebBench -- for real-world web browsing tasks.
arXiv Detail & Related papers (2024-04-04T17:58:40Z) - WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models [65.18602126334716]
Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots.
We introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites.
We show that WebVoyager achieves a 59.1% task success rate on our benchmark, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups.
arXiv Detail & Related papers (2024-01-25T03:33:18Z) - GPT-4V(ision) is a Generalist Web Agent, if Grounded [20.940613419944015]
We show that GPT-4V can successfully complete 51.1 of the tasks on live websites if we manually ground its textual plans into actions on the websites.
We propose SEEACT, a web agent that harnesses the power of LMMs for integrated visual understanding and acting on the web.
arXiv Detail & Related papers (2024-01-03T08:33:09Z) - Mind2Web: Towards a Generalist Agent for the Web [25.363429937913065]
Mind2Web is the first dataset for developing and evaluating generalist agents for the web.
With over 2,000 open-ended tasks collected from 137 websites spanning 31 domains, Mind2Web provides three necessary ingredients for building generalist web agents.
Based on Mind2Web, we conduct an initial exploration of using large language models (LLMs) for building generalist web agents.
arXiv Detail & Related papers (2023-06-09T17:44:31Z) - Multimodal Web Navigation with Instruction-Finetuned Foundation Models [99.14209521903854]
We study data-driven offline training for web agents with vision-language foundation models.
We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages.
We empirically demonstrate this recipe improves the agent's ability of grounded multimodal perception, HTML comprehension, and multi-step reasoning.
arXiv Detail & Related papers (2023-05-19T17:44:34Z) - Understanding HTML with Large Language Models [73.92747433749271]
Large language models (LLMs) have shown exceptional performance on a variety of natural language tasks.
We contribute HTML understanding models (fine-tuned LLMs) and an in-depth analysis of their capabilities under three tasks.
We show that LLMs pretrained on standard natural language corpora transfer remarkably well to HTML understanding tasks.
arXiv Detail & Related papers (2022-10-08T07:27:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.