DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis
- URL: http://arxiv.org/abs/2508.20033v1
- Date: Wed, 27 Aug 2025 16:36:34 GMT
- Title: DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis
- Authors: Liana Patel, Negar Arabzadeh, Harshit Gupta, Ankita Sundar, Ion Stoica, Matei Zaharia, Carlos Guestrin,
- Abstract summary: We introduce DeepScholar-bench, a live benchmark and holistic, automated evaluation framework designed to evaluate generative research synthesis systems.<n>DeepScholar-bench draws queries from recent, high-quality ArXiv papers and focuses on a real research synthesis task.<n>We also develop DeepScholar-base, a reference pipeline implemented efficiently using the LOTUS API.
- Score: 52.636738269442766
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability to research and synthesize knowledge is central to human expertise and progress. An emerging class of systems promises these exciting capabilities through generative research synthesis, performing retrieval over the live web and synthesizing discovered sources into long-form, cited summaries. However, evaluating such systems remains an open challenge: existing question-answering benchmarks focus on short-form factual responses, while expert-curated datasets risk staleness and data contamination. Both fail to capture the complexity and evolving nature of real research synthesis tasks. In this work, we introduce DeepScholar-bench, a live benchmark and holistic, automated evaluation framework designed to evaluate generative research synthesis. DeepScholar-bench draws queries from recent, high-quality ArXiv papers and focuses on a real research synthesis task: generating the related work sections of a paper by retrieving, synthesizing, and citing prior research. Our evaluation framework holistically assesses performance across three key dimensions, knowledge synthesis, retrieval quality, and verifiability. We also develop DeepScholar-base, a reference pipeline implemented efficiently using the LOTUS API. Using the DeepScholar-bench framework, we perform a systematic evaluation of prior open-source systems, search AI's, OpenAI's DeepResearch, and DeepScholar-base. We find that DeepScholar-base establishes a strong baseline, attaining competitive or higher performance than each other method. We also find that DeepScholar-bench remains far from saturated, with no system exceeding a score of $19\%$ across all metrics. These results underscore the difficulty of DeepScholar-bench, as well as its importance for progress towards AI systems capable of generative research synthesis. We make our code available at https://github.com/guestrin-lab/deepscholar-bench.
Related papers
- DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing [53.85037373860246]
We introduce Deep Synth-Eval, a benchmark designed to objectively evaluate information consolidation capabilities.<n>We propose a fine-grained evaluation protocol using General Checklists (for factual coverage) and Constraint Checklists (for structural organization)<n>Our results demonstrate that agentic plan-and-write significantly outperform single-turn generation.
arXiv Detail & Related papers (2026-01-07T03:07:52Z) - Step-DeepResearch Technical Report [90.50586290399683]
We introduce Step-DeepResearch, a cost-effective, end-to-end agent.<n>We propose a Data Synthesis Strategy Based on Atomic Capabilities to reinforce planning and report writing.<n>To bridge the evaluation gap in the Chinese domain, we establish ADR-Bench for realistic deep research scenarios.
arXiv Detail & Related papers (2025-12-23T16:32:27Z) - Understanding DeepResearch via Reports [41.60038455664918]
DeepResearch is a transformative AI paradigm, conducting expert-level research through sophisticated reasoning and multi-tool integration.<n> evaluating these systems remains critically challenging due to open-ended research scenarios and existing benchmarks that focus on isolated capabilities.<n>We introduce DeepResearch-ReportEval, a comprehensive framework designed to assess DeepResearch systems through their most representative outputs: research reports.
arXiv Detail & Related papers (2025-10-09T07:03:43Z) - WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents [72.28593628378991]
WebResearcher is an iterative deep-research paradigm that reformulates deep research as a Markov Decision Process.<n>WebResearcher achieves state-of-the-art performance, even surpassing frontier proprietary systems.
arXiv Detail & Related papers (2025-09-16T17:57:17Z) - Open Data Synthesis For Deep Research [17.22470203913576]
We formalize Deep Research tasks with verifiable answers as Hierarchical Constraint Satisfaction Problems.<n>Existing benchmarks (e.g., Natural Questions, HotpotQA) fail to capture this complexity.<n>We introduce InfoSeek, a scalable scalable complex Deep Research tasks.
arXiv Detail & Related papers (2025-08-30T06:02:56Z) - DeepResearch$^{\text{Eco}}$: A Recursive Agentic Workflow for Complex Scientific Question Answering in Ecology [0.0]
DeepResearch is a novel agentic LLM-based system for automated scientific synthesis.<n>It supports depth- and breadth-controlled exploration of original research questions.<n>DeepResearch achieves up to a 21-fold increase in source integration.
arXiv Detail & Related papers (2025-07-14T17:47:28Z) - Benchmarking Deep Search over Heterogeneous Enterprise Data [73.55304268238474]
We present a new benchmark for evaluating a form of retrieval-augmented generation (RAG)<n>RAG requires source-aware, multi-hop reasoning over diverse, sparsed, but related sources.<n>We build it using a synthetic data pipeline that simulates business across product planning, development, and support stages.
arXiv Detail & Related papers (2025-06-29T08:34:59Z) - From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents [96.65646344634524]
Large Language Models (LLMs), endowed with reasoning and agentic capabilities, are ushering in a new paradigm termed Agentic Deep Research.<n>We trace the evolution from static web search to interactive, agent-based systems that plan, explore, and learn.<n>We demonstrate that Agentic Deep Research not only significantly outperforms existing approaches, but is also poised to become the dominant paradigm for future information seeking.
arXiv Detail & Related papers (2025-06-23T17:27:19Z) - DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents [30.768405850755602]
DeepResearch Bench is a benchmark consisting of 100 PhD-level research tasks.<n> evaluating Deep Research Agents is inherently complex and labor-intensive.<n>We propose two novel methodologies that achieve strong alignment with human judgment.
arXiv Detail & Related papers (2025-06-13T13:17:32Z) - DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research [25.368303145176554]
DeepResearchGym is an open-source sandbox that combines a search API with a rigorous evaluation protocol for benchmarking deep research systems.<n>The API indexes large-scale public web corpora, namely ClueWeb22 and FineWeb, using a state-of-the-art dense retriever and approximate nearest neighbor search via DiskANN.<n>It achieves lower latency than popular commercial APIs while ensuring stable document rankings across runs, and is freely available for research use.
arXiv Detail & Related papers (2025-05-25T18:16:13Z) - SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis [89.99161034065614]
Retrieval-augmented generation (RAG) systems have advanced large language models (LLMs) in complex deep search scenarios.<n>Existing approaches face critical limitations that lack high-quality training trajectories and suffer from distributional mismatches.<n>This paper introduces SimpleDeepSearcher, a framework that bridges the gap through strategic data engineering rather than complex training paradigms.
arXiv Detail & Related papers (2025-05-22T16:05:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.