Related papers: Needle in the Web: A Benchmark for Retrieving Targeted Web Pages in the Wild

Needle in the Web: A Benchmark for Retrieving Targeted Web Pages in the Wild

URL: http://arxiv.org/abs/2512.16553v1
Date: Thu, 18 Dec 2025 13:57:28 GMT
Title: Needle in the Web: A Benchmark for Retrieving Targeted Web Pages in the Wild
Authors: Yumeng Wang, Tianyu Fan, Lingrui Xu, Chao Huang,
Abstract summary: Needle in the Web is a novel benchmark designed to evaluate modern search agents and LLM-based systems on their ability to retrieve and reason over real-world web content.<n>We benchmark three leading LLMs and three agent-based search systems on Needle in the Web, finding that most models struggle.<n>These findings reveal that Needle in the Web presents a significant challenge for current search systems.
Score: 9.91566589898295
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have evolved from simple chatbots into sophisticated agents capable of automating complex real-world tasks, where browsing and reasoning over live web content is key to assessing retrieval and cognitive skills. Existing benchmarks like BrowseComp and xBench-DeepSearch emphasize complex reasoning searches requiring multi-hop synthesis but neglect Fuzzy Exploratory Search, namely queries that are vague and multifaceted, where users seek the most relevant webpage rather than a single factual answer. To address this gap, we introduce Needle in the Web, a novel benchmark specifically designed to evaluate modern search agents and LLM-based systems on their ability to retrieve and reason over real-world web content in response to ambiguous, exploratory queries under varying levels of difficulty. Needle in the Web comprises 663 questions spanning seven distinct domains. To ensure high query quality and answer uniqueness, we employ a flexible methodology that reliably generates queries of controllable difficulty based on factual claims of web contents. We benchmark three leading LLMs and three agent-based search systems on Needle in the Web, finding that most models struggle: many achieve below 35% accuracy, and none consistently excel across domains or difficulty levels. These findings reveal that Needle in the Web presents a significant challenge for current search systems and highlights the open problem of effective fuzzy retrieval under semantic ambiguity.

Related papers

Revisiting Text Ranking in Deep Research [24.324221566628125]
Black-box web search APIs hinder systematic analysis of search components.<n>We reproduce a selection of key findings and best practices for IR text ranking methods in the deep research setting.
arXiv Detail & Related papers (2026-02-25T00:18:07Z)
LiveNewsBench: Evaluating LLM Web Search Capabilities with Freshly Curated News [29.74044158672979]
Large Language Models (LLMs) with agentic web search capabilities show strong potential for tasks requiring real-time information access and complex fact retrieval.<n>We introduce bench, a benchmark designed to assess the agentic web search abilities of LLMs.<n>bench automatically generates fresh question-answer pairs from recent news articles.
arXiv Detail & Related papers (2026-02-14T01:18:51Z)
DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search [61.77858432092777]
We present DeepMMSearch-R1, the first multimodal large language model capable of performing on-demand, multi-turn web searches.<n>DeepMMSearch-R1 can initiate web searches based on relevant crops of the input image making the image search more effective.<n>We conduct extensive experiments across a range of knowledge-intensive benchmarks to demonstrate the superiority of our approach.
arXiv Detail & Related papers (2025-10-14T17:59:58Z)
WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents [57.203515352080295]
We introduce WebExplorer: a systematic data generation approach using model-based exploration and iterative, long-to-short query evolution.<n>Our model supports 128K context length and up to 100 tool calling turns, enabling long-horizon problem solving.<n>As an 8B-sized model, WebExplorer-8B is able to effectively search over an average of 16 turns after RL training.
arXiv Detail & Related papers (2025-09-08T10:07:03Z)
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent [68.3311163530321]
Web agents such as Deep Research have demonstrated cognitive abilities, capable of solving highly challenging information-seeking problems.<n>This makes multimodal Deep Research highly challenging, as such agents require much stronger reasoning abilities in perception, logic, knowledge.<n>We introduce WebWatcher, a multi-modal Agent for Deep Research equipped with enhanced visual-language reasoning capabilities.
arXiv Detail & Related papers (2025-08-07T18:03:50Z)
ManuSearch: Democratizing Deep Search in Large Language Models with a Transparent and Open Multi-Agent Framework [73.91207117772291]
ManuSearch is a transparent and modular multi-agent framework designed to democratize deep search for large language models (LLMs)<n>ManuSearch decomposes the search and reasoning process into three collaborative agents: (1) a solution planning agent that iteratively formulates sub-queries, (2) an Internet search agent that retrieves relevant documents via real-time web search, and (3) a structured webpage reading agent that extracts key evidence from raw web content.
arXiv Detail & Related papers (2025-05-23T17:02:02Z)
Level-Navi Agent: A Framework and benchmark for Chinese Web Search Agents [9.003325286793288]
Large language models (LLMs), adopted to understand human language, drive the development of artificial intelligence (AI) web search agents.<n>We propose a general-purpose and training-free web search agent by level-aware navigation, Level-Navi Agent, accompanied by a well-annotated dataset (Web24) and a suitable evaluation metric.
arXiv Detail & Related papers (2024-12-20T08:03:12Z)
MindSearch: Mimicking Human Minds Elicits Deep AI Searcher [50.68599514830046]
We introduce MindSearch to mimic the human minds in web information seeking and integration.<n>The framework can be instantiated by a simple yet effective LLM-based multi-agent framework.<n> MindSearch demonstrates significant improvement in the response quality in terms of depth and breadth.
arXiv Detail & Related papers (2024-07-29T17:12:40Z)
Exposing Query Identification for Search Transparency [69.06545074617685]
We explore the feasibility of approximate exposing query identification (EQI) as a retrieval task by reversing the role of queries and documents in two classes of search systems. We derive an evaluation metric to measure the quality of a ranking of exposing queries, as well as conducting an empirical analysis focusing on various practical aspects of approximate EQI.
arXiv Detail & Related papers (2021-10-14T20:19:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.