Related papers: DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing

DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing

URL: http://arxiv.org/abs/2601.03540v1
Date: Wed, 07 Jan 2026 03:07:52 GMT
Title: DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing
Authors: Hongzhi Zhang, Yuanze Hu, Tinghai Zhang, Jia Fu, Tao Wang, Junwei Jing, Zhaoxin Fan, Qi Wang, Ruiming Tang, Han Li, Guorui Zhou, Kun Gai,
Abstract summary: We introduce Deep Synth-Eval, a benchmark designed to objectively evaluate information consolidation capabilities.<n>We propose a fine-grained evaluation protocol using General Checklists (for factual coverage) and Constraint Checklists (for structural organization)<n>Our results demonstrate that agentic plan-and-write significantly outperform single-turn generation.
Score: 53.85037373860246
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The evolution of Large Language Models (LLMs) towards autonomous agents has catalyzed progress in Deep Research. While retrieval capabilities are well-benchmarked, the post-retrieval synthesis stage--where agents must digest massive amounts of context and consolidate fragmented evidence into coherent, long-form reports--remains under-evaluated due to the subjectivity of open-ended writing. To bridge this gap, we introduce DeepSynth-Eval, a benchmark designed to objectively evaluate information consolidation capabilities. We leverage high-quality survey papers as gold standards, reverse-engineering research requests and constructing "Oracle Contexts" from their bibliographies to isolate synthesis from retrieval noise. We propose a fine-grained evaluation protocol using General Checklists (for factual coverage) and Constraint Checklists (for structural organization), transforming subjective judgment into verifiable metrics. Experiments across 96 tasks reveal that synthesizing information from hundreds of references remains a significant challenge. Our results demonstrate that agentic plan-and-write workflows significantly outperform single-turn generation, effectively reducing hallucinations and improving adherence to complex structural constraints.

Related papers

AgentCPM-Report: Interleaving Drafting and Deepening for Open-Ended Deep Research [85.51475655916026]
AgentCPM-Report is a lightweight yet high-performing local solution composed of a framework that mirrors the human writing process.<n>Our framework uses a Writing As Reasoning Policy (WARP), which enables models to dynamically revise outlines.<n>Experiments on DeepResearch Bench, DeepConsult, and DeepResearch Gym demonstrate that AgentCPM-Report outperforms leading closed-source systems.
arXiv Detail & Related papers (2026-02-06T09:45:04Z)
ScholarGym: Benchmarking Deep Research Workflows on Academic Literature Retrieval [11.41528830724814]
We present ScholarGym, a simulation environment for reproducible evaluation of deep research on academic literature.<n>Built on a static corpus of 570K papers with deterministic retrieval, ScholarGym provides 2,536 queries with expert-annotated ground truth.
arXiv Detail & Related papers (2026-01-29T12:51:44Z)
CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning [48.56088080889236]
We introduce CorpusQA, a new benchmark scaling up to 10 million tokens, generated via a novel data synthesis framework.<n>We show that finetuning on our synthesized data effectively enhances an LLM's general long-context reasoning capabilities.<n>Our findings indicate that memory-augmented agentic architectures offer a more robust alternative.
arXiv Detail & Related papers (2026-01-21T12:52:30Z)
DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation [56.886936435727854]
DeepResearchEval is an automated framework for deep research task construction and agentic evaluation.<n>For task construction, we propose a persona-driven pipeline generating realistic, complex research tasks anchored in diverse user profiles.<n>For evaluation, we propose an agentic pipeline with two components: an Adaptive Point-wise Quality Evaluation that dynamically derives task-specific evaluation dimensions, criteria, and weights conditioned on each generated task, and an Active Fact-Checking that autonomously extracts and verifies report statements via web search, even when citations are missing.
arXiv Detail & Related papers (2026-01-14T18:38:31Z)
SynClaimEval: A Framework for Evaluating the Utility of Synthetic Data in Long-Context Claim Verification [1.740313383876245]
We introduce SynClaimEval, a framework for evaluating synthetic data utility in long-context claim verification.<n>Our framework examines three dimensions: (i) input characteristics, by varying context length and testing generalization to out-of-domain benchmarks; (ii) synthesis logic, by controlling claim complexity and error type variation; and (iii) explanation quality, measuring the degree to which model explanations provide evidence consistent with predictions.
arXiv Detail & Related papers (2025-11-12T18:36:59Z)
EvoSyn: Generalizable Evolutionary Data Synthesis for Verifiable Learning [63.03672166010434]
We introduce an evolutionary, task-agnostic, strategy-guided, executably-checkable data synthesis framework.<n>It jointly synthesizes problems, diverse candidate solutions, and verification artifacts.<n>It iteratively discovers strategies via a consistency-based evaluator that enforces agreement between human-annotated and strategy-induced checks.
arXiv Detail & Related papers (2025-10-20T11:56:35Z)
A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports [24.09178055088843]
Deep Research Agents (DRAs) exhibit the capabilities for task decomposition, cross-source retrieval, multi-stage reasoning, and structured output.<n>This paper introduces a rigorous benchmark and a multidimensional evaluation framework tailored to DRAs and report-style responses.<n>The framework enables comprehensive evaluation of long-form reports generated by DRAs, incorporating integrated scoring metrics for semantic quality, topical focus, and retrieval trustworthiness.
arXiv Detail & Related papers (2025-10-02T16:40:02Z)
WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research [73.58638285105971]
This paper tackles textbfopen-ended deep research (OEDR), a complex challenge where AI agents must synthesize vast web-scale information into insightful reports.<n>We introduce textbfWebWeaver, a novel dual-agent framework that emulates the human research process.<n>Our framework establishes a new state-of-the-art across major OEDR benchmarks, including DeepResearch Bench, DeepConsult, and DeepResearchGym.
arXiv Detail & Related papers (2025-09-16T17:57:21Z)
Benchmarking Deep Search over Heterogeneous Enterprise Data [73.55304268238474]
We present a new benchmark for evaluating a form of retrieval-augmented generation (RAG)<n>RAG requires source-aware, multi-hop reasoning over diverse, sparsed, but related sources.<n>We build it using a synthetic data pipeline that simulates business across product planning, development, and support stages.
arXiv Detail & Related papers (2025-06-29T08:34:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.