WebLists: Extracting Structured Information From Complex Interactive Websites Using Executable LLM Agents
- URL: http://arxiv.org/abs/2504.12682v1
- Date: Thu, 17 Apr 2025 06:16:40 GMT
- Title: WebLists: Extracting Structured Information From Complex Interactive Websites Using Executable LLM Agents
- Authors: Arth Bohra, Manvel Saroyan, Danil Melkozerov, Vahe Karufanyan, Gabriel Maher, Pascal Weinberger, Artem Harutyunyan, Giovanni Campagna,
- Abstract summary: We present WebLists, a benchmark of 200 data-extraction tasks across four common business and enterprise use-cases.<n>We show that both LLMs with search capabilities and SOTA web agents struggle with these tasks, with a recall of 3% and 31%, respectively.<n>We propose BardeenAgent, a novel framework that enables web agents to convert their execution into repeatable programs, and replay them at scale across pages with similar structure.<n>On the WebLists benchmark, BardeenAgent achieves 66% recall overall, more than doubling the performance of SOTA web agents, and reducing cost per output row by 3x.
- Score: 1.6673034682613495
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most recent web agent research has focused on navigation and transaction tasks, with little emphasis on extracting structured data at scale. We present WebLists, a benchmark of 200 data-extraction tasks across four common business and enterprise use-cases. Each task requires an agent to navigate to a webpage, configure it appropriately, and extract complete datasets with well-defined schemas. We show that both LLMs with search capabilities and SOTA web agents struggle with these tasks, with a recall of 3% and 31%, respectively, despite higher performance on question-answering tasks. To address this challenge, we propose BardeenAgent, a novel framework that enables web agents to convert their execution into repeatable programs, and replay them at scale across pages with similar structure. BardeenAgent is also the first LLM agent to take advantage of the regular structure of HTML. In particular BardeenAgent constructs a generalizable CSS selector to capture all relevant items on the page, then fits the operations to extract the data. On the WebLists benchmark, BardeenAgent achieves 66% recall overall, more than doubling the performance of SOTA web agents, and reducing cost per output row by 3x.
Related papers
- PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC [98.82146219495792]
In this paper, we propose a hierarchical agent framework named PC-Agent.
From the perception perspective, we devise an Active Perception Module (APM) to overcome the inadequate abilities of current MLLMs in perceiving screenshot content.
From the decision-making perspective, to handle complex user instructions and interdependent subtasks more effectively, we propose a hierarchical multi-agent collaboration architecture.
arXiv Detail & Related papers (2025-02-20T05:41:55Z) - AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials [53.376263056033046]
Existing approaches rely on expensive human annotation, making them unsustainable at scale.
We propose AgentTrek, a scalable data synthesis pipeline that generates web agent trajectories by leveraging publicly available tutorials.
Our fully automated approach significantly reduces data collection costs, achieving a cost of just $0.55 per high-quality trajectory without human annotators.
arXiv Detail & Related papers (2024-12-12T18:59:27Z) - ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data [18.129300915372415]
Large Language Model (LLM) agents are rapidly improving to handle increasingly complex web-based tasks.<n>General-purpose LLMs are not specifically trained to understand specialized web contexts such as HTML.<n>We explore an alternative approach that fine-tunes open-source LLMs using production-scale workflow data collected from over 250 domains corresponding to 6 billion tokens.
arXiv Detail & Related papers (2024-11-22T15:26:23Z) - Infogent: An Agent-Based Framework for Web Information Aggregation [59.67710556177564]
We introduce Infogent, a novel framework for web information aggregation.
Experiments on different information access settings demonstrate Infogent beats an existing SOTA multi-agent search framework by 7%.
arXiv Detail & Related papers (2024-10-24T18:01:28Z) - Beyond Browsing: API-Based Web Agents [58.39129004543844]
API-based agents outperform web browsing agents in experiments on WebArena.<n>Hybrid Agents out-perform both others nearly uniformly across tasks.<n>Results strongly suggest that when APIs are available, they present an attractive alternative to relying on web browsing alone.
arXiv Detail & Related papers (2024-10-21T19:46:06Z) - AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents [52.13695464678006]
This study enhances an LLM-based web agent by simply refining its observation and action space.
AgentOccam surpasses the previous state-of-the-art and concurrent work by 9.8 (+29.4%) and 5.9 (+15.8%) absolute points respectively.
arXiv Detail & Related papers (2024-10-17T17:50:38Z) - AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation [54.17246674188208]
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts.
Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website.
We introduce the paradigm of generating web scrapers with large language models (LLMs) and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently.
arXiv Detail & Related papers (2024-04-19T09:59:44Z) - AutoWebGLM: A Large Language Model-based Web Navigating Agent [33.55199326570078]
We develop the open AutoWebGLM based on ChatGLM3-6B.
Inspired by human browsing patterns, we first design an HTML simplification algorithm to represent webpages.
We then employ a hybrid human-AI method to build web browsing data for curriculum training.
arXiv Detail & Related papers (2024-04-04T17:58:40Z) - A Real-World WebAgent with Planning, Long Context Understanding, and
Program Synthesis [69.15016747150868]
We introduce WebAgent, an agent that learns from self-experience to complete tasks on real websites.
WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites.
We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks.
arXiv Detail & Related papers (2023-07-24T14:56:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.