Related papers: DEER: A Comprehensive and Reliable Benchmark for Deep-Research Expert Reports

DEER: A Comprehensive and Reliable Benchmark for Deep-Research Expert Reports

URL: http://arxiv.org/abs/2512.17776v1
Date: Fri, 19 Dec 2025 16:46:20 GMT
Title: DEER: A Comprehensive and Reliable Benchmark for Deep-Research Expert Reports
Authors: Janghoon Han, Heegyu Kim, Changho Lee, Dahm Lee, Min Hyung Park, Hosung Song, Stanley Jungkyu Choi, Moontae Lee, Honglak Lee,
Abstract summary: deep research systems can generate expert-level reports via multi-step reasoning and evidence-based synthesis.<n>Existing benchmarks often lack systematic criteria for expert reporting.<n>We introduce DEER, a benchmark for evaluating expert-level deep research reports.
Score: 49.217247659479476
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) advance, deep research systems can generate expert-level reports via multi-step reasoning and evidence-based synthesis, but evaluating such reports remains challenging. Existing benchmarks often lack systematic criteria for expert reporting, evaluations that rely heavily on LLM judges can fail to capture issues that require expert judgment, and source verification typically covers only a limited subset of explicitly cited statements rather than report-wide factual reliability. We introduce DEER, a benchmark for evaluating expert-level deep research reports. DEER comprises 50 report-writing tasks spanning 13 domains and an expert-grounded evaluation taxonomy (7 dimensions, 25 sub-dimension) operationalized into 130 fine-grained rubric items. DEER further provides task-specific expert guidance to help LLM judges assess expert-level report quality more consistently. Complementing rubric-based assessment, we propose a document-level fact-checking architecture that extracts and verifies all claims across the entire report, including both cited and uncited ones, and quantifies external-evidence quality. DEER correlates closely with human expert judgments and yields interpretable diagnostics of system strengths and weaknesses.

Related papers

DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report [36.25273583677749]
We introduce Deep Research Bench II, a new benchmark for evaluating deep-research systems.<n>For each task, a system must produce a long-form research report that is evaluated by a set of 9430 fine-grained binary rubrics.<n>We evaluate several state-of-the-art deep-research systems on Deep Research Bench II and find that even the strongest models satisfy fewer than 50% of the rubrics.
arXiv Detail & Related papers (2026-01-13T13:18:39Z)
How Far Are We from Genuinely Useful Deep Research Agents? [48.596990593729]
Deep Research Agents (DRAs) aim to automatically produce analyst-level reports through iterative information retrieval and synthesis.<n>Current benchmarks for report synthesis suffer from task complexity and subjective metrics.<n>We present Fine-grained DEepResearch bench (FINDER), an enhanced benchmark consisting of 100 human-curated research tasks.
arXiv Detail & Related papers (2025-12-01T17:58:59Z)
LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild [86.6586720134927]
LiveResearchBench is a benchmark of 100 expert-curated tasks spanning daily life, enterprise, and academia.<n>DeepEval is a comprehensive suite covering both content- and report-level quality.<n>Our analysis reveals current strengths, recurring failure modes, and key system components needed to advance reliable, insightful deep research.
arXiv Detail & Related papers (2025-10-16T02:49:16Z)
Towards Real-Time Fake News Detection under Evidence Scarcity [66.58597356379907]
We propose Evaluation-Aware Selection of Experts (EASE), a novel framework for real-time fake news detection.<n>EASE adapts its decision-making process according to the assessed sufficiency of available evidence.<n>We introduce RealTimeNews-25, a new benchmark for evaluating model generalization on emerging news with limited evidence.
arXiv Detail & Related papers (2025-10-13T11:11:46Z)
A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports [24.09178055088843]
Deep Research Agents (DRAs) exhibit the capabilities for task decomposition, cross-source retrieval, multi-stage reasoning, and structured output.<n>This paper introduces a rigorous benchmark and a multidimensional evaluation framework tailored to DRAs and report-style responses.<n>The framework enables comprehensive evaluation of long-form reports generated by DRAs, incorporating integrated scoring metrics for semantic quality, topical focus, and retrieval trustworthiness.
arXiv Detail & Related papers (2025-10-02T16:40:02Z)
Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges [72.3356133063925]
The paradigm of large language models (LLMs) as judges has emerged as a scalable solution, yet prior work primarily focuses on simple settings.<n>Our in-depth analysis offers crucial insights for improving the accuracy and verifiability of evaluation signals.
arXiv Detail & Related papers (2025-09-03T15:48:33Z)
ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks [14.371010711040304]
ReportBench is a benchmark designed to evaluate the content quality of research reports generated by large language models (LLMs)<n>Our evaluation focuses on two critical dimensions: (1) the quality and relevance of cited literature, and (2) the faithfulness and veracity of the statements within the generated reports.
arXiv Detail & Related papers (2025-08-14T03:33:43Z)
Expert Preference-based Evaluation of Automated Related Work Generation [54.29459509574242]
We propose GREP, a multi-turn evaluation framework that integrates classical related work evaluation criteria with expert-specific preferences.<n>For better accessibility, we design two variants of GREP: a more precise variant with proprietary LLMs as evaluators, and a cheaper alternative with open-weight LLMs.
arXiv Detail & Related papers (2025-08-11T13:08:07Z)
VerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains [19.579511315215424]
Large language models rely on reinforcement learning to enhance their reasoning capabilities through feedback.<n>Existing research focuses on building better verifiers, yet a systematic evaluation of different types of verifiers' performance remains lacking.<n>We construct 4,000 expert-level questions covering mathematics, physics, chemistry, and biology.<n>Each question is equipped with reference answers and diverse responses.
arXiv Detail & Related papers (2025-07-14T03:45:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.