AutoWebGLM: A Large Language Model-based Web Navigating Agent
- URL: http://arxiv.org/abs/2404.03648v2
- Date: Sat, 12 Oct 2024 05:05:38 GMT
- Title: AutoWebGLM: A Large Language Model-based Web Navigating Agent
- Authors: Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, Jie Tang,
- Abstract summary: We develop the open AutoWebGLM based on ChatGLM3-6B.
Inspired by human browsing patterns, we first design an HTML simplification algorithm to represent webpages.
We then employ a hybrid human-AI method to build web browsing data for curriculum training.
- Score: 33.55199326570078
- License:
- Abstract: Large language models (LLMs) have fueled many intelligent web agents, but most existing ones perform far from satisfying in real-world web navigation tasks due to three factors: (1) the complexity of HTML text data (2) versatility of actions on webpages, and (3) task difficulty due to the open-domain nature of the web. In light of these challenges, we develop the open AutoWebGLM based on ChatGLM3-6B. AutoWebGLM can serve as a powerful automated web navigation agent that outperform GPT-4. Inspired by human browsing patterns, we first design an HTML simplification algorithm to represent webpages with vital information preserved succinctly. We then employ a hybrid human-AI method to build web browsing data for curriculum training. Finally, we bootstrap the model by reinforcement learning and rejection sampling to further facilitate webpage comprehension, browser operations, and efficient task decomposition by itself. For comprehensive evaluation, we establish a bilingual benchmark -- AutoWebBench -- for real-world web navigation tasks. We evaluate AutoWebGLM across diverse web navigation benchmarks, demonstrating its potential to tackle challenging tasks in real environments. Related code, model, and data are released at \url{https://github.com/THUDM/AutoWebGLM}.
Related papers
- Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs [112.89665642941814]
Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio.
Current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code.
We propose Web2Code, a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning.
arXiv Detail & Related papers (2024-06-28T17:59:46Z) - AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation [54.17246674188208]
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts.
Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website.
We introduce the paradigm of generating web scrapers with large language models (LLMs) and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently.
arXiv Detail & Related papers (2024-04-19T09:59:44Z) - WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models [65.18602126334716]
Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots.
We introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites.
We show that WebVoyager achieves a 59.1% task success rate on our benchmark, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups.
arXiv Detail & Related papers (2024-01-25T03:33:18Z) - AllTogether: Investigating the Efficacy of Spliced Prompt for Web
Navigation using Large Language Models [2.234037966956278]
We introduce AllTogether, a standardized prompt template that enhances task context representation.
We evaluate the efficacy of this approach through prompt learning and instruction finetuning based on open-source Llama-2 and API-accessible GPT models.
arXiv Detail & Related papers (2023-10-20T11:10:14Z) - A Real-World WebAgent with Planning, Long Context Understanding, and
Program Synthesis [69.15016747150868]
We introduce WebAgent, an agent that learns from self-experience to complete tasks on real websites.
WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites.
We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks.
arXiv Detail & Related papers (2023-07-24T14:56:30Z) - Mind2Web: Towards a Generalist Agent for the Web [25.363429937913065]
Mind2Web is the first dataset for developing and evaluating generalist agents for the web.
With over 2,000 open-ended tasks collected from 137 websites spanning 31 domains, Mind2Web provides three necessary ingredients for building generalist web agents.
Based on Mind2Web, we conduct an initial exploration of using large language models (LLMs) for building generalist web agents.
arXiv Detail & Related papers (2023-06-09T17:44:31Z) - Multimodal Web Navigation with Instruction-Finetuned Foundation Models [99.14209521903854]
We study data-driven offline training for web agents with vision-language foundation models.
We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages.
We empirically demonstrate this recipe improves the agent's ability of grounded multimodal perception, HTML comprehension, and multi-step reasoning.
arXiv Detail & Related papers (2023-05-19T17:44:34Z) - Understanding HTML with Large Language Models [73.92747433749271]
Large language models (LLMs) have shown exceptional performance on a variety of natural language tasks.
We contribute HTML understanding models (fine-tuned LLMs) and an in-depth analysis of their capabilities under three tasks.
We show that LLMs pretrained on standard natural language corpora transfer remarkably well to HTML understanding tasks.
arXiv Detail & Related papers (2022-10-08T07:27:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.