Related papers: WebWalker: Benchmarking LLMs in Web Traversal

WebWalker: Benchmarking LLMs in Web Traversal

URL: http://arxiv.org/abs/2501.07572v2
Date: Tue, 14 Jan 2025 15:06:56 GMT
Title: WebWalker: Benchmarking LLMs in Web Traversal
Authors: Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, Fei Huang,
Abstract summary: We introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal.<n>We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm.
Score: 64.48425443951749
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Retrieval-augmented generation (RAG) demonstrates remarkable performance across tasks in open-domain question-answering. However, traditional search engines may retrieve shallow content, limiting the ability of LLMs to handle complex, multi-layered information. To address it, we introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal. It evaluates the capacity of LLMs to traverse a website's subpages to extract high-quality data systematically. We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm. Extensive experimental results show that WebWalkerQA is challenging and demonstrates the effectiveness of RAG combined with WebWalker, through the horizontal and vertical integration in real-world scenarios.

Related papers

Nested Browser-Use Learning for Agentic Information Seeking [60.775556172513014]
Information-seeking (IS) agents have achieved strong performance across a range of wide and deep search tasks, yet their tool use remains largely restricted to API-level snippet retrieval and URL-based page fetching.<n>We propose Nested Browser-Use Learning (NestBrowse), which introduces a minimal and complete browser-action framework that decouples interaction control from page exploration through a nested structure.
arXiv Detail & Related papers (2025-12-29T17:59:14Z)
WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality [62.43165871914528]
We introduce WebDevJudge, a systematic benchmark for assessing LLM-as-a-judge performance in web development.<n>WebDevJudge comprises human preference labels over paired web implementations, annotated with structured and query-grounded rubrics.<n>In-depth analysis indicates this gap stems from fundamental model limitations, including failures in recognizing functional equivalence, verifying task feasibility, and mitigating bias.
arXiv Detail & Related papers (2025-10-21T12:16:04Z)
Temac: Multi-Agent Collaboration for Automated Web GUI Testing [10.661373474430604]
We propose Temac, an approach that enhances automated web GUI testing (AWGT) using large language models (LLMs) to increase code coverage.<n>Our evaluation results show that Temac exceeds state-of-the-art approaches from 12.5% to 60.3% on average code coverage on six complex open-source web applications.
arXiv Detail & Related papers (2025-05-31T11:43:37Z)
AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents [52.13695464678006]
This study enhances an LLM-based web agent by simply refining its observation and action space. AgentOccam surpasses the previous state-of-the-art and concurrent work by 9.8 (+29.4%) and 5.9 (+15.8%) absolute points respectively.
arXiv Detail & Related papers (2024-10-17T17:50:38Z)
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs [112.89665642941814]
Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio. Current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. We propose a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning.
arXiv Detail & Related papers (2024-06-28T17:59:46Z)
AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation [54.17246674188208]
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts. Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website. We introduce the paradigm of generating web scrapers with large language models (LLMs) and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently.
arXiv Detail & Related papers (2024-04-19T09:59:44Z)
VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? [115.60866817774641]
Multimodal Large Language models (MLLMs) have shown promise in web-related tasks. evaluating their performance in the web domain remains a challenge due to the lack of comprehensive benchmarks. bench is a multimodal benchmark designed to assess the capabilities of MLLMs across a variety of web tasks.
arXiv Detail & Related papers (2024-04-09T02:29:39Z)
AutoWebGLM: A Large Language Model-based Web Navigating Agent [33.55199326570078]
We develop the open AutoWebGLM based on ChatGLM3-6B. Inspired by human browsing patterns, we first design an HTML simplification algorithm to represent webpages. We then employ a hybrid human-AI method to build web browsing data for curriculum training.
arXiv Detail & Related papers (2024-04-04T17:58:40Z)
On the Multi-turn Instruction Following for Conversational Web Agents [83.51251174629084]
We introduce a new task of Conversational Web Navigation, which necessitates sophisticated interactions that span multiple turns with both the users and the environment. We propose a novel framework, named self-reflective memory-augmented planning (Self-MAP), which employs memory utilization and self-reflection techniques.
arXiv Detail & Related papers (2024-02-23T02:18:12Z)
AllTogether: Investigating the Efficacy of Spliced Prompt for Web Navigation using Large Language Models [2.234037966956278]
We introduce AllTogether, a standardized prompt template that enhances task context representation. We evaluate the efficacy of this approach through prompt learning and instruction finetuning based on open-source Llama-2 and API-accessible GPT models.
arXiv Detail & Related papers (2023-10-20T11:10:14Z)
LASER: LLM Agent with State-Space Exploration for Web Navigation [57.802977310392755]
Large language models (LLMs) have been successfully adapted for interactive decision-making tasks like web navigation. Previous methods implicitly assume a forward-only execution mode for the model, where they only provide oracle trajectories as in-context examples. We propose to model the interactive task as state space exploration, where the LLM agent transitions among a pre-defined set of states by performing actions to complete the task.
arXiv Detail & Related papers (2023-09-15T05:44:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.