AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation
- URL: http://arxiv.org/abs/2404.12753v1
- Date: Fri, 19 Apr 2024 09:59:44 GMT
- Title: AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation
- Authors: Wenhao Huang, Chenghao Peng, Zhixu Li, Jiaqing Liang, Yanghua Xiao, Liqian Wen, Zulong Chen,
- Abstract summary: We introduce a crawler generation task for vertical information web pages.
We propose AutoCrawler, a two-stage framework that leverages the hierarchical structure of HTML for progressive understanding.
- Score: 55.86438100985539
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Web automation is a significant technique that accomplishes complicated web tasks by automating common web actions, enhancing operational efficiency, and reducing the need for manual intervention. Traditional methods, such as wrappers, suffer from limited adaptability and scalability when faced with a new website. On the other hand, generative agents empowered by large language models (LLMs) exhibit poor performance and reusability in open-world scenarios. In this work, we introduce a crawler generation task for vertical information web pages and the paradigm of combining LLMs with crawlers, which helps crawlers handle diverse and changing web environments more efficiently. We propose AutoCrawler, a two-stage framework that leverages the hierarchical structure of HTML for progressive understanding. Through top-down and step-back operations, AutoCrawler can learn from erroneous actions and continuously prune HTML for better action generation. We conduct comprehensive experiments with multiple LLMs and demonstrate the effectiveness of our framework. Resources of this paper can be found at \url{https://github.com/EZ-hwh/AutoCrawler}
Related papers
- WebWalker: Benchmarking LLMs in Web Traversal [64.48425443951749]
We introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal.
We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm.
arXiv Detail & Related papers (2025-01-13T18:58:07Z) - Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs [112.89665642941814]
Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio.
Current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code.
We propose a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning.
arXiv Detail & Related papers (2024-06-28T17:59:46Z) - WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs [49.91550773480978]
This paper introduces WebCode2M, a new dataset comprising 2.56 million instances, each containing a design image along with the corresponding webpage code and layout details.
To validate the effectiveness of WebCode2M, we introduce a baseline model based on the Vision Transformer (ViT), named WebCoder, and establish a benchmark for fair comparison.
The benchmarking results demonstrate that our dataset significantly improves the ability of MLLMs to generate code from webpage designs.
arXiv Detail & Related papers (2024-04-09T15:05:48Z) - VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? [115.60866817774641]
Multimodal Large Language models (MLLMs) have shown promise in web-related tasks.
evaluating their performance in the web domain remains a challenge due to the lack of comprehensive benchmarks.
bench is a multimodal benchmark designed to assess the capabilities of MLLMs across a variety of web tasks.
arXiv Detail & Related papers (2024-04-09T02:29:39Z) - AutoWebGLM: A Large Language Model-based Web Navigating Agent [33.55199326570078]
We develop the open AutoWebGLM based on ChatGLM3-6B.
Inspired by human browsing patterns, we first design an HTML simplification algorithm to represent webpages.
We then employ a hybrid human-AI method to build web browsing data for curriculum training.
arXiv Detail & Related papers (2024-04-04T17:58:40Z) - On the Multi-turn Instruction Following for Conversational Web Agents [83.51251174629084]
We introduce a new task of Conversational Web Navigation, which necessitates sophisticated interactions that span multiple turns with both the users and the environment.
We propose a novel framework, named self-reflective memory-augmented planning (Self-MAP), which employs memory utilization and self-reflection techniques.
arXiv Detail & Related papers (2024-02-23T02:18:12Z) - Cleaner Pretraining Corpus Curation with Neural Web Scraping [39.97459187762505]
This paper presents a simple, fast, and effective Neural web Scraper (NeuScraper) to help extract primary and clean text contents from webpages.
Experimental results show that NeuScraper surpasses the baseline scrapers by achieving more than a 20% improvement.
arXiv Detail & Related papers (2024-02-22T16:04:03Z) - A Real-World WebAgent with Planning, Long Context Understanding, and
Program Synthesis [69.15016747150868]
We introduce WebAgent, an agent that learns from self-experience to complete tasks on real websites.
WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites.
We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks.
arXiv Detail & Related papers (2023-07-24T14:56:30Z) - Understanding HTML with Large Language Models [73.92747433749271]
Large language models (LLMs) have shown exceptional performance on a variety of natural language tasks.
We contribute HTML understanding models (fine-tuned LLMs) and an in-depth analysis of their capabilities under three tasks.
We show that LLMs pretrained on standard natural language corpora transfer remarkably well to HTML understanding tasks.
arXiv Detail & Related papers (2022-10-08T07:27:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.