SurveyBench: Can LLM(-Agents) Write Academic Surveys that Align with Reader Needs?
- URL: http://arxiv.org/abs/2510.03120v2
- Date: Mon, 06 Oct 2025 13:13:37 GMT
- Title: SurveyBench: Can LLM(-Agents) Write Academic Surveys that Align with Reader Needs?
- Authors: Zhaojun Sun, Xuzhou Zhu, Xuanhe Zhou, Xin Tong, Shuo Wang, Jie Fu, Guoliang Li, Zhiyuan Liu, Fan Wu,
- Abstract summary: Survey writing is a labor-intensive and intellectually demanding task.<n>Recent approaches, such as general DeepResearch agents and survey-specialized methods, can generate surveys automatically.<n>But their outputs often fall short of human standards and there lacks a rigorous, reader-aligned benchmark.<n>We propose a fine-grained, quiz-driven evaluation framework SurveyBench.
- Score: 37.28508850738341
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Academic survey writing, which distills vast literature into a coherent and insightful narrative, remains a labor-intensive and intellectually demanding task. While recent approaches, such as general DeepResearch agents and survey-specialized methods, can generate surveys automatically (a.k.a. LLM4Survey), their outputs often fall short of human standards and there lacks a rigorous, reader-aligned benchmark for thoroughly revealing their deficiencies. To fill the gap, we propose a fine-grained, quiz-driven evaluation framework SurveyBench, featuring (1) typical survey topics source from recent 11,343 arXiv papers and corresponding 4,947 high-quality surveys; (2) a multifaceted metric hierarchy that assesses the outline quality (e.g., coverage breadth, logical coherence), content quality (e.g., synthesis granularity, clarity of insights), and non-textual richness; and (3) a dual-mode evaluation protocol that includes content-based and quiz-based answerability tests, explicitly aligned with readers' informational needs. Results show SurveyBench effectively challenges existing LLM4Survey approaches (e.g., on average 21% lower than human in content-based evaluation).
Related papers
- RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension [65.81339691942757]
RPC-Bench is a large-scale question-answering benchmark built from review-rebuttal exchanges of high-quality computer science papers.<n>We design a fine-grained taxonomy aligned with the scientific research flow to assess models' ability to understand and answer why, what, and how questions in scholarly contexts.
arXiv Detail & Related papers (2026-01-14T11:37:00Z) - OpenNovelty: An LLM-powered Agentic System for Verifiable Scholarly Novelty Assessment [63.662126457336534]
OpenNovelty is an agentic system for transparent, evidence-based novelty analysis.<n>It grounds all assessments in retrieved real papers, ensuring verifiable judgments.<n>OpenNovelty aims to empower the research community with a scalable tool that promotes fair, consistent, and evidence-backed peer review.
arXiv Detail & Related papers (2026-01-04T15:48:51Z) - LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild [86.6586720134927]
LiveResearchBench is a benchmark of 100 expert-curated tasks spanning daily life, enterprise, and academia.<n>DeepEval is a comprehensive suite covering both content- and report-level quality.<n>Our analysis reveals current strengths, recurring failure modes, and key system components needed to advance reliable, insightful deep research.
arXiv Detail & Related papers (2025-10-16T02:49:16Z) - Benchmarking Computer Science Survey Generation [18.844790013427282]
SurGE (Survey Generation Evaluation) is a new benchmark for evaluating scientific survey generation in the computer science domain.<n>SurGE consists of (1) a collection of test instances, each including a topic description, an expert-written survey, and its full set of cited references, and (2) a large-scale academic corpus of over one million papers that serves as the retrieval pool.<n>In addition, we propose an automated evaluation framework that measures generated surveys across four dimensions: information coverage, referencing accuracy, structural organization, and content quality.
arXiv Detail & Related papers (2025-08-21T15:45:10Z) - Let's Use ChatGPT To Write Our Paper! Benchmarking LLMs To Write the Introduction of a Research Paper [64.50822834679101]
SciIG is a task that evaluates LLMs' ability to produce coherent introductions from titles, abstracts, and related works.<n>We assess five state-of-the-art models, including open-source (DeepSeek-v3, Gemma-3-12B, LLaMA 4-Maverick, MistralAI Small 3.1) and closed-source GPT-4o systems.<n>Results demonstrate LLaMA-4 Maverick's superior performance on most metrics, particularly in semantic similarity and faithfulness.
arXiv Detail & Related papers (2025-08-19T21:11:11Z) - SciSage: A Multi-Agent Framework for High-Quality Scientific Survey Generation [2.985620880452744]
SciSage is a multi-agent framework employing a reflect-when-you-write paradigm.<n>It critically evaluates drafts at outline, section, and document levels, collaborating with specialized agents for query interpretation, content retrieval, and refinement.<n>We also release SurveyScope, a benchmark of 46 high-impact papers ( 2020-2025) across 11 computer science domains.
arXiv Detail & Related papers (2025-06-15T02:23:47Z) - Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol [83.90769864167301]
Literature review tables are essential for summarizing and comparing collections of scientific papers.<n>We explore the task of generating tables that best fulfill a user's informational needs given a collection of scientific papers.<n>Our contributions focus on three key challenges encountered in real-world use: (i) User prompts are often under-specified; (ii) Retrieved candidate papers frequently contain irrelevant content; and (iii) Task evaluation should move beyond shallow text similarity techniques.
arXiv Detail & Related papers (2025-04-14T14:52:28Z) - SurveyForge: On the Outline Heuristics, Memory-Driven Generation, and Multi-dimensional Evaluation for Automated Survey Writing [13.101632066188532]
We introduce SurveyForge, which generates the outline by analyzing the logical structure of human-written outlines.<n>To achieve a comprehensive evaluation, we construct SurveyBench, which includes 100 human-written survey papers for win-rate comparison.<n>Experiments demonstrate that SurveyForge can outperform previous works such as AutoSurvey.
arXiv Detail & Related papers (2025-03-06T17:15:48Z) - ResearchArena: Benchmarking Large Language Models' Ability to Collect and Organize Information as Research Agents [30.603079363363634]
This study introduces ResearchArena, a benchmark designed to evaluate large language models' capabilities in conducting academic surveys.<n>ResearchArena models the process in three stages: (1) information discovery, identifying relevant literature; (2) information selection, evaluating papers' relevance and impact; and (3) information organization.<n>To support these evaluations, we construct an offline environment of 12M full-text academic papers and 7.9K survey papers.
arXiv Detail & Related papers (2024-06-13T03:26:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.