LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild
- URL: http://arxiv.org/abs/2510.14240v1
- Date: Thu, 16 Oct 2025 02:49:16 GMT
- Title: LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild
- Authors: Jiayu Wang, Yifei Ming, Riya Dulepet, Qinglin Chen, Austin Xu, Zixuan Ke, Frederic Sala, Aws Albarghouthi, Caiming Xiong, Shafiq Joty,
- Abstract summary: LiveResearchBench is a benchmark of 100 expert-curated tasks spanning daily life, enterprise, and academia.<n>DeepEval is a comprehensive suite covering both content- and report-level quality.<n>Our analysis reveals current strengths, recurring failure modes, and key system components needed to advance reliable, insightful deep research.
- Score: 86.6586720134927
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Deep research -- producing comprehensive, citation-grounded reports by searching and synthesizing information from hundreds of live web sources -- marks an important frontier for agentic systems. To rigorously evaluate this ability, four principles are essential: tasks should be (1) user-centric, reflecting realistic information needs, (2) dynamic, requiring up-to-date information beyond parametric knowledge, (3) unambiguous, ensuring consistent interpretation across users, and (4) multi-faceted and search-intensive, requiring search over numerous web sources and in-depth analysis. Existing benchmarks fall short of these principles, often focusing on narrow domains or posing ambiguous questions that hinder fair comparison. Guided by these principles, we introduce LiveResearchBench, a benchmark of 100 expert-curated tasks spanning daily life, enterprise, and academia, each requiring extensive, dynamic, real-time web search and synthesis. Built with over 1,500 hours of human labor, LiveResearchBench provides a rigorous basis for systematic evaluation. To evaluate citation-grounded long-form reports, we introduce DeepEval, a comprehensive suite covering both content- and report-level quality, including coverage, presentation, citation accuracy and association, consistency and depth of analysis. DeepEval integrates four complementary evaluation protocols, each designed to ensure stable assessment and high agreement with human judgments. Using LiveResearchBench and DeepEval, we conduct a comprehensive evaluation of 17 frontier deep research systems, including single-agent web search, single-agent deep research, and multi-agent systems. Our analysis reveals current strengths, recurring failure modes, and key system components needed to advance reliable, insightful deep research.
Related papers
- DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation [56.886936435727854]
DeepResearchEval is an automated framework for deep research task construction and agentic evaluation.<n>For task construction, we propose a persona-driven pipeline generating realistic, complex research tasks anchored in diverse user profiles.<n>For evaluation, we propose an agentic pipeline with two components: an Adaptive Point-wise Quality Evaluation that dynamically derives task-specific evaluation dimensions, criteria, and weights conditioned on each generated task, and an Active Fact-Checking that autonomously extracts and verifies report statements via web search, even when citations are missing.
arXiv Detail & Related papers (2026-01-14T18:38:31Z) - DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report [36.25273583677749]
We introduce Deep Research Bench II, a new benchmark for evaluating deep-research systems.<n>For each task, a system must produce a long-form research report that is evaluated by a set of 9430 fine-grained binary rubrics.<n>We evaluate several state-of-the-art deep-research systems on Deep Research Bench II and find that even the strongest models satisfy fewer than 50% of the rubrics.
arXiv Detail & Related papers (2026-01-13T13:18:39Z) - Understanding DeepResearch via Reports [41.60038455664918]
DeepResearch is a transformative AI paradigm, conducting expert-level research through sophisticated reasoning and multi-tool integration.<n> evaluating these systems remains critically challenging due to open-ended research scenarios and existing benchmarks that focus on isolated capabilities.<n>We introduce DeepResearch-ReportEval, a comprehensive framework designed to assess DeepResearch systems through their most representative outputs: research reports.
arXiv Detail & Related papers (2025-10-09T07:03:43Z) - Towards Personalized Deep Research: Benchmarks and Evaluations [56.581105664044436]
We introduce Personalized Deep Research Bench, the first benchmark for evaluating personalization in Deep Research Agents (DRAs)<n>It pairs 50 diverse research tasks with 25 authentic user profiles that combine structured persona attributes with dynamic real-world contexts, yielding 250 realistic user-task queries.<n>Our experiments on a range of systems highlight current capabilities and limitations in handling personalized deep research.
arXiv Detail & Related papers (2025-09-29T17:39:17Z) - WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents [72.28593628378991]
WebResearcher is an iterative deep-research paradigm that reformulates deep research as a Markov Decision Process.<n>WebResearcher achieves state-of-the-art performance, even surpassing frontier proprietary systems.
arXiv Detail & Related papers (2025-09-16T17:57:17Z) - DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence [50.97612134791782]
Generative search engines and deep research LLM agents promise trustworthy, source-grounded synthesis, yet users regularly encounter overconfidence, weak sourcing, and confusing citation practices.<n>We introduce DeepTRACE, a novel sociotechnically grounded audit framework that turns prior community-identified failure cases into eight measurable dimensions spanning answer text, sources, and citations.
arXiv Detail & Related papers (2025-09-02T00:32:38Z) - BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent [74.10138164281618]
BrowseComp-Plus is a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus.<n>This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods.
arXiv Detail & Related papers (2025-08-08T17:55:11Z) - Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge [34.672897171399775]
Agentic search systems autonomously browse the web, synthesize information, and return comprehensive citation-backed answers.<n>Mind2Web 2 is a benchmark of 130 realistic, high-quality, and long-horizon tasks constructed with over 1000 hours of human labor.<n>Our method constructs task-specific judge agents based on a tree-structured design to automatically assess both answer correctness and source attribution.
arXiv Detail & Related papers (2025-06-26T17:32:50Z) - DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents [30.768405850755602]
DeepResearch Bench is a benchmark consisting of 100 PhD-level research tasks.<n> evaluating Deep Research Agents is inherently complex and labor-intensive.<n>We propose two novel methodologies that achieve strong alignment with human judgment.
arXiv Detail & Related papers (2025-06-13T13:17:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.