Related papers: LiveNewsBench: Evaluating LLM Web Search Capabilities with Freshly Curated News

LiveNewsBench: Evaluating LLM Web Search Capabilities with Freshly Curated News

URL: http://arxiv.org/abs/2602.13543v1
Date: Sat, 14 Feb 2026 01:18:51 GMT
Title: LiveNewsBench: Evaluating LLM Web Search Capabilities with Freshly Curated News
Authors: Yunfan Zhang, Kathleen McKeown, Smaranda Muresan,
Abstract summary: Large Language Models (LLMs) with agentic web search capabilities show strong potential for tasks requiring real-time information access and complex fact retrieval.<n>We introduce bench, a benchmark designed to assess the agentic web search abilities of LLMs.<n>bench automatically generates fresh question-answer pairs from recent news articles.
Score: 29.74044158672979
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large Language Models (LLMs) with agentic web search capabilities show strong potential for tasks requiring real-time information access and complex fact retrieval, yet evaluating such systems remains challenging. We introduce \bench, a rigorous and regularly updated benchmark designed to assess the agentic web search abilities of LLMs. \bench automatically generates fresh question-answer pairs from recent news articles, ensuring that questions require information beyond an LLM's training data and enabling clear separation between internal knowledge and search capability. The benchmark features intentionally difficult questions requiring multi-hop search queries, page visits, and reasoning, making it well-suited for evaluating agentic search behavior. Our automated data curation and question generation pipeline enables frequent benchmark updates and supports construction of a large-scale training dataset for agentic web search models, addressing the scarcity of such data in the research community. To ensure reliable evaluation, we include a subset of human-verified samples in the test set. We evaluate a broad range of systems using \bench, including commercial and open-weight LLMs as well as LLM-based web search APIs. The leaderboard, datasets, and code are publicly available at livenewsbench.com.

Related papers

GISA: A Benchmark for General Information-Seeking Assistant [102.30831921333755]
GISA is a benchmark for General Information-Seeking Assistants comprising 373 human-crafted queries.<n>It integrates both deep reasoning and broad information aggregation within unified tasks, and includes a live subset with periodically updated answers to resist memorization.<n>Experiments on mainstream LLMs and commercial search products reveal that even the best-performing model achieves only 19.30% exact match score.
arXiv Detail & Related papers (2026-02-09T11:44:15Z)
DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search [61.77858432092777]
We present DeepMMSearch-R1, the first multimodal large language model capable of performing on-demand, multi-turn web searches.<n>DeepMMSearch-R1 can initiate web searches based on relevant crops of the input image making the image search more effective.<n>We conduct extensive experiments across a range of knowledge-intensive benchmarks to demonstrate the superiority of our approach.
arXiv Detail & Related papers (2025-10-14T17:59:58Z)
InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation [63.55258191625131]
InfoDeepSeek is a new benchmark for assessing agentic information seeking in real-world, dynamic web environments.<n>We propose a systematic methodology for constructing challenging queries satisfying the criteria of determinacy, difficulty, and diversity.<n>We develop the first evaluation framework tailored to dynamic agentic information seeking, including fine-grained metrics about the accuracy, utility, and compactness of information seeking outcomes.
arXiv Detail & Related papers (2025-05-21T14:44:40Z)
ZeroSearch: Incentivize the Search Capability of LLMs without Searching [69.55482019211597]
We introduce ZeroSearch, a framework that incentivizes the capabilities of large language models to use a real search engine with simulated searches during training.<n>Our approach begins with lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both useful and noisy documents.
arXiv Detail & Related papers (2025-05-07T17:30:22Z)
Level-Navi Agent: A Framework and benchmark for Chinese Web Search Agents [9.003325286793288]
Large language models (LLMs), adopted to understand human language, drive the development of artificial intelligence (AI) web search agents.<n>We propose a general-purpose and training-free web search agent by level-aware navigation, Level-Navi Agent, accompanied by a well-annotated dataset (Web24) and a suitable evaluation metric.
arXiv Detail & Related papers (2024-12-20T08:03:12Z)
SRSA: A Cost-Efficient Strategy-Router Search Agent for Real-world Human-Machine Interactions [3.5725872564627785]
In real-world situations, users often input contextual and highly personalized queries to chatbots. Previous research has not focused specifically on authentic human-machine dialogue scenarios. To address these gaps, we propose a Strategy-based Search Agent (SRSA) SRSA routing different queries to appropriate search strategies and enabling fine-grained serial searches to obtain high-quality results at a relatively low cost.
arXiv Detail & Related papers (2024-11-21T20:41:55Z)
When Search Engine Services meet Large Language Models: Visions and Challenges [53.32948540004658]
This paper conducts an in-depth examination of how integrating Large Language Models with search engines can mutually benefit both technologies. We focus on two main areas: using search engines to improve LLMs (Search4LLM) and enhancing search engine functions using LLMs (LLM4Search)
arXiv Detail & Related papers (2024-06-28T03:52:13Z)
DCA-Bench: A Benchmark for Dataset Curation Agents [9.60250892491588]
Data quality issues, such as incomplete documentation, inaccurate labels, ethical concerns, and outdated information, remain common in widely used datasets.<n>With the surging ability of large language models (LLM), it's promising to streamline the discovery of hidden dataset issues with LLM agents.<n>In this work, we establish a benchmark to measure LLM agent's ability to tackle this challenge.
arXiv Detail & Related papers (2024-06-11T14:02:23Z)
STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases [93.96463520716759]
We develop STARK, a large-scale Semi-structure retrieval benchmark on Textual and Knowledge Bases. Our benchmark covers three domains: product search, academic paper search, and queries in precision medicine. We design a novel pipeline to synthesize realistic user queries that integrate diverse relational information and complex textual properties.
arXiv Detail & Related papers (2024-04-19T22:54:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.