Related papers: DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report

DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report

URL: http://arxiv.org/abs/2601.08536v1
Date: Tue, 13 Jan 2026 13:18:39 GMT
Title: DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report
Authors: Ruizhe Li, Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, Zhendong Mao,
Abstract summary: We introduce Deep Research Bench II, a new benchmark for evaluating deep-research systems.<n>For each task, a system must produce a long-form research report that is evaluated by a set of 9430 fine-grained binary rubrics.<n>We evaluate several state-of-the-art deep-research systems on Deep Research Bench II and find that even the strongest models satisfy fewer than 50% of the rubrics.
Score: 36.25273583677749
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Deep Research Systems (DRS) aim to help users search the web, synthesize information, and deliver comprehensive investigative reports. However, how to rigorously evaluate these systems remains under-explored. Existing deep-research benchmarks often fall into two failure modes. Some do not adequately test a system's ability to analyze evidence and write coherent reports. Others rely on evaluation criteria that are either overly coarse or directly defined by LLMs (or both), leading to scores that can be biased relative to human experts and are hard to verify or interpret. To address these issues, we introduce Deep Research Bench II, a new benchmark for evaluating DRS-generated reports. It contains 132 grounded research tasks across 22 domains; for each task, a system must produce a long-form research report that is evaluated by a set of 9430 fine-grained binary rubrics in total, covering three dimensions: information recall, analysis, and presentation. All rubrics are derived from carefully selected expert-written investigative articles and are constructed through a four-stage LLM+human pipeline that combines automatic extraction with over 400 human-hours of expert review, ensuring that the criteria are atomic, verifiable, and aligned with human expert judgment. We evaluate several state-of-the-art deep-research systems on Deep Research Bench II and find that even the strongest models satisfy fewer than 50% of the rubrics, revealing a substantial gap between current DRSs and human experts.

Related papers

Can Deep Research Agents Find and Organize? Evaluating the Synthesis Gap with Expert Taxonomies [57.11324429385405]
We introduce TaxoBench, a diagnostic benchmark derived from 72 computer science surveys.<n>We manually extract expert-authored taxonomy trees containing 3,815 precisely categorized citations as ground truth.<n>Best agent recalls only 20.9% of expert-selected papers, and even with perfect input, the best model achieves only 0.31 ARI in organization.
arXiv Detail & Related papers (2026-01-18T11:57:09Z)
DEER: A Comprehensive and Reliable Benchmark for Deep-Research Expert Reports [49.217247659479476]
deep research systems can generate expert-level reports via multi-step reasoning and evidence-based synthesis.<n>Existing benchmarks often lack systematic criteria for expert reporting.<n>We introduce DEER, a benchmark for evaluating expert-level deep research reports.
arXiv Detail & Related papers (2025-12-19T16:46:20Z)
How Far Are We from Genuinely Useful Deep Research Agents? [48.596990593729]
Deep Research Agents (DRAs) aim to automatically produce analyst-level reports through iterative information retrieval and synthesis.<n>Current benchmarks for report synthesis suffer from task complexity and subjective metrics.<n>We present Fine-grained DEepResearch bench (FINDER), an enhanced benchmark consisting of 100 human-curated research tasks.
arXiv Detail & Related papers (2025-12-01T17:58:59Z)
ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents [11.666923792025313]
Deep Research (DR) is an emerging agent application that leverages large language models to address open-ended queries.<n>We introduce ResearchRubrics, a standardized benchmark for DR built with over 2,800+ hours of human labor.<n>We also propose a new complexity framework for categorizing DR tasks along three axes: conceptual breadth, logical nesting, and exploration.
arXiv Detail & Related papers (2025-11-10T23:07:14Z)
LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild [86.6586720134927]
LiveResearchBench is a benchmark of 100 expert-curated tasks spanning daily life, enterprise, and academia.<n>DeepEval is a comprehensive suite covering both content- and report-level quality.<n>Our analysis reveals current strengths, recurring failure modes, and key system components needed to advance reliable, insightful deep research.
arXiv Detail & Related papers (2025-10-16T02:49:16Z)
Understanding DeepResearch via Reports [41.60038455664918]
DeepResearch is a transformative AI paradigm, conducting expert-level research through sophisticated reasoning and multi-tool integration.<n> evaluating these systems remains critically challenging due to open-ended research scenarios and existing benchmarks that focus on isolated capabilities.<n>We introduce DeepResearch-ReportEval, a comprehensive framework designed to assess DeepResearch systems through their most representative outputs: research reports.
arXiv Detail & Related papers (2025-10-09T07:03:43Z)
DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence [50.97612134791782]
Generative search engines and deep research LLM agents promise trustworthy, source-grounded synthesis, yet users regularly encounter overconfidence, weak sourcing, and confusing citation practices.<n>We introduce DeepTRACE, a novel sociotechnically grounded audit framework that turns prior community-identified failure cases into eight measurable dimensions spanning answer text, sources, and citations.
arXiv Detail & Related papers (2025-09-02T00:32:38Z)
BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent [74.10138164281618]
BrowseComp-Plus is a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus.<n>This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods.
arXiv Detail & Related papers (2025-08-08T17:55:11Z)
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents [30.768405850755602]
DeepResearch Bench is a benchmark consisting of 100 PhD-level research tasks.<n> evaluating Deep Research Agents is inherently complex and labor-intensive.<n>We propose two novel methodologies that achieve strong alignment with human judgment.
arXiv Detail & Related papers (2025-06-13T13:17:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.