Related papers: AutoWebGLM: A Large Language Model-based Web Navigating Agent

AutoWebGLM: A Large Language Model-based Web Navigating Agent

URL: http://arxiv.org/abs/2404.03648v2
Date: Sat, 12 Oct 2024 05:05:38 GMT
Title: AutoWebGLM: A Large Language Model-based Web Navigating Agent
Authors: Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, Jie Tang,
Abstract summary: We develop the open AutoWebGLM based on ChatGLM3-6B. Inspired by human browsing patterns, we first design an HTML simplification algorithm to represent webpages. We then employ a hybrid human-AI method to build web browsing data for curriculum training.
Score: 33.55199326570078
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have fueled many intelligent web agents, but most existing ones perform far from satisfying in real-world web navigation tasks due to three factors: (1) the complexity of HTML text data (2) versatility of actions on webpages, and (3) task difficulty due to the open-domain nature of the web. In light of these challenges, we develop the open AutoWebGLM based on ChatGLM3-6B. AutoWebGLM can serve as a powerful automated web navigation agent that outperform GPT-4. Inspired by human browsing patterns, we first design an HTML simplification algorithm to represent webpages with vital information preserved succinctly. We then employ a hybrid human-AI method to build web browsing data for curriculum training. Finally, we bootstrap the model by reinforcement learning and rejection sampling to further facilitate webpage comprehension, browser operations, and efficient task decomposition by itself. For comprehensive evaluation, we establish a bilingual benchmark -- AutoWebBench -- for real-world web navigation tasks. We evaluate AutoWebGLM across diverse web navigation benchmarks, demonstrating its potential to tackle challenging tasks in real environments. Related code, model, and data are released at \url{https://github.com/THUDM/AutoWebGLM}.

Related papers

WebWalker: Benchmarking LLMs in Web Traversal [64.48425443951749]
We introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal. We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm.
arXiv Detail & Related papers (2025-01-13T18:58:07Z)
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials [53.376263056033046]
Existing approaches rely on expensive human annotation, making them unsustainable at scale. We propose AgentTrek, a scalable data synthesis pipeline that generates web agent trajectories by leveraging publicly available tutorials. Our fully automated approach significantly reduces data collection costs, achieving a cost of just $0.55 per high-quality trajectory without human annotators.
arXiv Detail & Related papers (2024-12-12T18:59:27Z)
Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents [23.1522773245956]
We introduce a novel paradigm that augments language agents with model-based planning. Our method, WebDreamer, builds on the key insight that LLMs inherently encode comprehensive knowledge about website structures and functionalities.
arXiv Detail & Related papers (2024-11-10T18:50:51Z)
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs [112.89665642941814]
Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio. Current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. We propose a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning.
arXiv Detail & Related papers (2024-06-28T17:59:46Z)
AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation [54.17246674188208]
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts. Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website. We introduce the paradigm of generating web scrapers with large language models (LLMs) and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently.
arXiv Detail & Related papers (2024-04-19T09:59:44Z)
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models [65.18602126334716]
Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots. We introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites. We show that WebVoyager achieves a 59.1% task success rate on our benchmark, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups.
arXiv Detail & Related papers (2024-01-25T03:33:18Z)
AllTogether: Investigating the Efficacy of Spliced Prompt for Web Navigation using Large Language Models [2.234037966956278]
We introduce AllTogether, a standardized prompt template that enhances task context representation. We evaluate the efficacy of this approach through prompt learning and instruction finetuning based on open-source Llama-2 and API-accessible GPT models.
arXiv Detail & Related papers (2023-10-20T11:10:14Z)
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis [69.15016747150868]
We introduce WebAgent, an agent that learns from self-experience to complete tasks on real websites. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites. We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks.
arXiv Detail & Related papers (2023-07-24T14:56:30Z)
Mind2Web: Towards a Generalist Agent for the Web [25.363429937913065]
Mind2Web is the first dataset for developing and evaluating generalist agents for the web. With over 2,000 open-ended tasks collected from 137 websites spanning 31 domains, Mind2Web provides three necessary ingredients for building generalist web agents. Based on Mind2Web, we conduct an initial exploration of using large language models (LLMs) for building generalist web agents.
arXiv Detail & Related papers (2023-06-09T17:44:31Z)
Multimodal Web Navigation with Instruction-Finetuned Foundation Models [99.14209521903854]
We study data-driven offline training for web agents with vision-language foundation models. We propose an instruction-following multimodal agent, WebGUM, that observes both webpage screenshots and HTML pages. We empirically demonstrate this recipe improves the agent's ability of grounded multimodal perception, HTML comprehension, and multi-step reasoning.
arXiv Detail & Related papers (2023-05-19T17:44:34Z)
Understanding HTML with Large Language Models [73.92747433749271]
Large language models (LLMs) have shown exceptional performance on a variety of natural language tasks. We contribute HTML understanding models (fine-tuned LLMs) and an in-depth analysis of their capabilities under three tasks. We show that LLMs pretrained on standard natural language corpora transfer remarkably well to HTML understanding tasks.
arXiv Detail & Related papers (2022-10-08T07:27:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.