Related papers: How Far Are We from Genuinely Useful Deep Research Agents?

How Far Are We from Genuinely Useful Deep Research Agents?

URL: http://arxiv.org/abs/2512.01948v1
Date: Mon, 01 Dec 2025 17:58:59 GMT
Title: How Far Are We from Genuinely Useful Deep Research Agents?
Authors: Dingling Zhang, He Zhu, Jincheng Ren, Kangqi Song, Xinran Zhou, Boyu Feng, Shudong Liu, Jiabin Luo, Weihao Xie, Zhaohui Wang, Tianrui Qin, King Zhu, Yuqing Wang, Qianben Chen, Yuchen Eleanor Jiang, Wei Wang, Jiaheng Liu, Wangchunshu Zhou,
Abstract summary: Deep Research Agents (DRAs) aim to automatically produce analyst-level reports through iterative information retrieval and synthesis.<n>Current benchmarks for report synthesis suffer from task complexity and subjective metrics.<n>We present Fine-grained DEepResearch bench (FINDER), an enhanced benchmark consisting of 100 human-curated research tasks.
Score: 48.596990593729
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Deep Research Agents (DRAs) aim to automatically produce analyst-level reports through iterative information retrieval and synthesis. However, most existing DRAs were validated on question-answering benchmarks, while research on generating comprehensive reports remains overlooked. Worse, current benchmarks for report synthesis suffer from task complexity and subjective metrics -- this fails to reflect user demands and limits the practical utility of generated reports. To address these gaps, we present Fine-grained DEepResearch bench (FINDER), an enhanced benchmark consisting of 100 human-curated research tasks with 419 structured checklist items that standardize report structure, analytical depth, and factual grounding. Based on approximately 1,000 reports produced by mainstream DRAs, we further propose Deep rEsearch Failure Taxonomy (DEFT), the first failure taxonomy for deep research agents. DEFT contains 14 fine-grained failure modes across reasoning, retrieval, and generation, and is built upon grounded theory with human-LLM co-annotating and inter-annotator reliability validation. Our experimental findings reveal that current DRAs struggle not with task comprehension but with evidence integration, verification, and reasoning-resilient planning.

Related papers

DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation [56.886936435727854]
DeepResearchEval is an automated framework for deep research task construction and agentic evaluation.<n>For task construction, we propose a persona-driven pipeline generating realistic, complex research tasks anchored in diverse user profiles.<n>For evaluation, we propose an agentic pipeline with two components: an Adaptive Point-wise Quality Evaluation that dynamically derives task-specific evaluation dimensions, criteria, and weights conditioned on each generated task, and an Active Fact-Checking that autonomously extracts and verifies report statements via web search, even when citations are missing.
arXiv Detail & Related papers (2026-01-14T18:38:31Z)
DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report [36.25273583677749]
We introduce Deep Research Bench II, a new benchmark for evaluating deep-research systems.<n>For each task, a system must produce a long-form research report that is evaluated by a set of 9430 fine-grained binary rubrics.<n>We evaluate several state-of-the-art deep-research systems on Deep Research Bench II and find that even the strongest models satisfy fewer than 50% of the rubrics.
arXiv Detail & Related papers (2026-01-13T13:18:39Z)
DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing [53.85037373860246]
We introduce Deep Synth-Eval, a benchmark designed to objectively evaluate information consolidation capabilities.<n>We propose a fine-grained evaluation protocol using General Checklists (for factual coverage) and Constraint Checklists (for structural organization)<n>Our results demonstrate that agentic plan-and-write significantly outperform single-turn generation.
arXiv Detail & Related papers (2026-01-07T03:07:52Z)
DEER: A Comprehensive and Reliable Benchmark for Deep-Research Expert Reports [49.217247659479476]
deep research systems can generate expert-level reports via multi-step reasoning and evidence-based synthesis.<n>Existing benchmarks often lack systematic criteria for expert reporting.<n>We introduce DEER, a benchmark for evaluating expert-level deep research reports.
arXiv Detail & Related papers (2025-12-19T16:46:20Z)
A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports [24.09178055088843]
Deep Research Agents (DRAs) exhibit the capabilities for task decomposition, cross-source retrieval, multi-stage reasoning, and structured output.<n>This paper introduces a rigorous benchmark and a multidimensional evaluation framework tailored to DRAs and report-style responses.<n>The framework enables comprehensive evaluation of long-form reports generated by DRAs, incorporating integrated scoring metrics for semantic quality, topical focus, and retrieval trustworthiness.
arXiv Detail & Related papers (2025-10-02T16:40:02Z)
DRBench: A Realistic Benchmark for Enterprise Deep Research [81.49694432639406]
DRBench is a benchmark for evaluating AI agents on complex, open-ended deep research tasks in enterprise settings.<n>We release 15 deep research tasks across 10 domains, such as Sales, Cybersecurity, and Compliance.
arXiv Detail & Related papers (2025-09-30T18:47:20Z)
WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research [73.58638285105971]
This paper tackles textbfopen-ended deep research (OEDR), a complex challenge where AI agents must synthesize vast web-scale information into insightful reports.<n>We introduce textbfWebWeaver, a novel dual-agent framework that emulates the human research process.<n>Our framework establishes a new state-of-the-art across major OEDR benchmarks, including DeepResearch Bench, DeepConsult, and DeepResearchGym.
arXiv Detail & Related papers (2025-09-16T17:57:21Z)
DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence [50.97612134791782]
Generative search engines and deep research LLM agents promise trustworthy, source-grounded synthesis, yet users regularly encounter overconfidence, weak sourcing, and confusing citation practices.<n>We introduce DeepTRACE, a novel sociotechnically grounded audit framework that turns prior community-identified failure cases into eight measurable dimensions spanning answer text, sources, and citations.
arXiv Detail & Related papers (2025-09-02T00:32:38Z)
ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks [14.371010711040304]
ReportBench is a benchmark designed to evaluate the content quality of research reports generated by large language models (LLMs)<n>Our evaluation focuses on two critical dimensions: (1) the quality and relevance of cited literature, and (2) the faithfulness and veracity of the statements within the generated reports.
arXiv Detail & Related papers (2025-08-14T03:33:43Z)
Characterizing Deep Research: A Benchmark and Formal Definition [24.523394260858822]
We propose a formal characterization of the deep research (DR) task and introduce a benchmark to evaluate the performance of DR systems.<n>We argue that the core defining feature of deep research is not the production of lengthy report-style outputs, but rather the high fan-out over concepts required during the search process.
arXiv Detail & Related papers (2025-08-06T08:09:28Z)
Benchmarking Deep Search over Heterogeneous Enterprise Data [73.55304268238474]
We present a new benchmark for evaluating a form of retrieval-augmented generation (RAG)<n>RAG requires source-aware, multi-hop reasoning over diverse, sparsed, but related sources.<n>We build it using a synthetic data pipeline that simulates business across product planning, development, and support stages.
arXiv Detail & Related papers (2025-06-29T08:34:59Z)
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents [30.768405850755602]
DeepResearch Bench is a benchmark consisting of 100 PhD-level research tasks.<n> evaluating Deep Research Agents is inherently complex and labor-intensive.<n>We propose two novel methodologies that achieve strong alignment with human judgment.
arXiv Detail & Related papers (2025-06-13T13:17:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.