DeepSurvey-Bench: Evaluating Academic Value of Automatically Generated Scientific Survey
- URL: http://arxiv.org/abs/2601.15307v1
- Date: Tue, 13 Jan 2026 14:42:56 GMT
- Title: DeepSurvey-Bench: Evaluating Academic Value of Automatically Generated Scientific Survey
- Authors: Guo-Biao Zhang, Ding-Yuan Liu, Da-Yi Wu, Tian Lan, Heyan Huang, Zhijing Wu, Xian-Ling Mao,
- Abstract summary: DeepSurvey-Bench is a novel benchmark designed to comprehensively evaluate the academic value of generated surveys.<n>We construct a reliable dataset with academic value annotations, and evaluate the deep academic value of the generated surveys.
- Score: 53.85391477976017
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid development of automated scientific survey generation technology has made it increasingly important to establish a comprehensive benchmark to evaluate the quality of generated surveys.Nearly all existing evaluation benchmarks rely on flawed selection criteria such as citation counts and structural coherence to select human-written surveys as the ground truth survey datasets, and then use surface-level metrics such as structural quality and reference relevance to evaluate generated surveys.However, these benchmarks have two key issues: (1) the ground truth survey datasets are unreliable because of a lack academic dimension annotations; (2) the evaluation metrics only focus on the surface quality of the survey such as logical coherence. Both issues lead to existing benchmarks cannot assess to evaluate their deep "academic value", such as the core research objectives and the critical analysis of different studies. To address the above problems, we propose DeepSurvey-Bench, a novel benchmark designed to comprehensively evaluate the academic value of generated surveys. Specifically, our benchmark propose a comprehensive academic value evaluation criteria covering three dimensions: informational value, scholarly communication value, and research guidance value. Based on this criteria, we construct a reliable dataset with academic value annotations, and evaluate the deep academic value of the generated surveys. Extensive experimental results demonstrate that our benchmark is highly consistent with human performance in assessing the academic value of generated surveys.
Related papers
- DREAM: Deep Research Evaluation with Agentic Metrics [21.555357444628044]
We propose DREAM (Deep Research Evaluation with Agentic Metrics), a framework that makes evaluation itself agentic.<n> DREAM structures assessment through an evaluation protocol combining query-agnostic metrics with adaptive metrics generated by a tool-calling agent.<n>Controlled evaluations demonstrate DREAM is significantly more sensitive to factual and temporal decay than existing benchmarks.
arXiv Detail & Related papers (2026-02-21T19:14:31Z) - Reward Modeling for Scientific Writing Evaluation [50.33952894976367]
It is critical to develop models that can be reliably deployed for evaluating diverse open-ended scientific writing tasks.<n>We propose cost-efficient, open-source reward models tailored for scientific writing evaluation.
arXiv Detail & Related papers (2026-01-16T15:32:58Z) - SurveyBench: Can LLM(-Agents) Write Academic Surveys that Align with Reader Needs? [37.28508850738341]
Survey writing is a labor-intensive and intellectually demanding task.<n>Recent approaches, such as general DeepResearch agents and survey-specialized methods, can generate surveys automatically.<n>But their outputs often fall short of human standards and there lacks a rigorous, reader-aligned benchmark.<n>We propose a fine-grained, quiz-driven evaluation framework SurveyBench.
arXiv Detail & Related papers (2025-10-03T15:49:09Z) - Transparent Reference-free Automated Evaluation of Open-Ended User Survey Responses [7.295969279816647]
Open-ended survey responses provide valuable insights in marketing research.<n>Low-quality responses not only burden researchers with manual filtering but also risk leading to misleading conclusions.<n>We propose a two-stage evaluation framework specifically designed for human survey responses.
arXiv Detail & Related papers (2025-10-03T08:37:33Z) - Towards Personalized Deep Research: Benchmarks and Evaluations [56.581105664044436]
We introduce Personalized Deep Research Bench, the first benchmark for evaluating personalization in Deep Research Agents (DRAs)<n>It pairs 50 diverse research tasks with 25 authentic user profiles that combine structured persona attributes with dynamic real-world contexts, yielding 250 realistic user-task queries.<n>Our experiments on a range of systems highlight current capabilities and limitations in handling personalized deep research.
arXiv Detail & Related papers (2025-09-29T17:39:17Z) - SurveyGen: Quality-Aware Scientific Survey Generation with Large Language Models [14.855783196702191]
We present SurveyGen, a large-scale dataset comprising over 4,200 human-written surveys across diverse scientific domains.<n>We build QUAL-SG, a novel quality-aware framework for survey generation.
arXiv Detail & Related papers (2025-08-25T04:22:23Z) - SurGE: A Benchmark and Evaluation Framework for Scientific Survey Generation [37.921524136479825]
SurGE (Survey Generation Evaluation) is a new benchmark for scientific survey generation in computer science.<n>SurGE consists of (1) a collection of test instances, each including a topic description, an expert-written survey, and its full set of cited references, and (2) a large-scale academic corpus of over one million papers.<n>In addition, we propose an automated evaluation framework that measures the quality of generated surveys across four dimensions.
arXiv Detail & Related papers (2025-08-21T15:45:10Z) - SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks [87.29946641069068]
We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature tasks.<n>By leveraging collective intelligence, SciArena offers a community-driven evaluation of model performance on open-ended scientific tasks.<n>We release SciArena-Eval, a meta-evaluation benchmark based on our collected preference data.
arXiv Detail & Related papers (2025-07-01T17:51:59Z) - Evaluating Step-by-step Reasoning Traces: A Survey [8.279021694489462]
Step-by-step reasoning is widely used to enhance the reasoning ability of large language models (LLMs) in complex problems.<n>Existing evaluation practices are highly inconsistent, resulting in fragmented progress across evaluator design and benchmark development.<n>This survey proposes a taxonomy of evaluation criteria with four top-level categories (factuality, validity, coherence, and utility)
arXiv Detail & Related papers (2025-02-17T19:58:31Z) - Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.