Related papers: SurveyBench: Can LLM(-Agents) Write Academic Surveys that Align with Reader Needs?

SurveyBench: Can LLM(-Agents) Write Academic Surveys that Align with Reader Needs?

URL: http://arxiv.org/abs/2510.03120v2
Date: Mon, 06 Oct 2025 13:13:37 GMT
Title: SurveyBench: Can LLM(-Agents) Write Academic Surveys that Align with Reader Needs?
Authors: Zhaojun Sun, Xuzhou Zhu, Xuanhe Zhou, Xin Tong, Shuo Wang, Jie Fu, Guoliang Li, Zhiyuan Liu, Fan Wu,
Abstract summary: Survey writing is a labor-intensive and intellectually demanding task.<n>Recent approaches, such as general DeepResearch agents and survey-specialized methods, can generate surveys automatically.<n>But their outputs often fall short of human standards and there lacks a rigorous, reader-aligned benchmark.<n>We propose a fine-grained, quiz-driven evaluation framework SurveyBench.
Score: 37.28508850738341
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Academic survey writing, which distills vast literature into a coherent and insightful narrative, remains a labor-intensive and intellectually demanding task. While recent approaches, such as general DeepResearch agents and survey-specialized methods, can generate surveys automatically (a.k.a. LLM4Survey), their outputs often fall short of human standards and there lacks a rigorous, reader-aligned benchmark for thoroughly revealing their deficiencies. To fill the gap, we propose a fine-grained, quiz-driven evaluation framework SurveyBench, featuring (1) typical survey topics source from recent 11,343 arXiv papers and corresponding 4,947 high-quality surveys; (2) a multifaceted metric hierarchy that assesses the outline quality (e.g., coverage breadth, logical coherence), content quality (e.g., synthesis granularity, clarity of insights), and non-textual richness; and (3) a dual-mode evaluation protocol that includes content-based and quiz-based answerability tests, explicitly aligned with readers' informational needs. Results show SurveyBench effectively challenges existing LLM4Survey approaches (e.g., on average 21% lower than human in content-based evaluation).

Related papers

RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension [65.81339691942757]
RPC-Bench is a large-scale question-answering benchmark built from review-rebuttal exchanges of high-quality computer science papers.<n>We design a fine-grained taxonomy aligned with the scientific research flow to assess models' ability to understand and answer why, what, and how questions in scholarly contexts.
arXiv Detail & Related papers (2026-01-14T11:37:00Z)
OpenNovelty: An LLM-powered Agentic System for Verifiable Scholarly Novelty Assessment [63.662126457336534]
OpenNovelty is an agentic system for transparent, evidence-based novelty analysis.<n>It grounds all assessments in retrieved real papers, ensuring verifiable judgments.<n>OpenNovelty aims to empower the research community with a scalable tool that promotes fair, consistent, and evidence-backed peer review.
arXiv Detail & Related papers (2026-01-04T15:48:51Z)
LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild [86.6586720134927]
LiveResearchBench is a benchmark of 100 expert-curated tasks spanning daily life, enterprise, and academia.<n>DeepEval is a comprehensive suite covering both content- and report-level quality.<n>Our analysis reveals current strengths, recurring failure modes, and key system components needed to advance reliable, insightful deep research.
arXiv Detail & Related papers (2025-10-16T02:49:16Z)
Benchmarking Computer Science Survey Generation [18.844790013427282]
SurGE (Survey Generation Evaluation) is a new benchmark for evaluating scientific survey generation in the computer science domain.<n>SurGE consists of (1) a collection of test instances, each including a topic description, an expert-written survey, and its full set of cited references, and (2) a large-scale academic corpus of over one million papers that serves as the retrieval pool.<n>In addition, we propose an automated evaluation framework that measures generated surveys across four dimensions: information coverage, referencing accuracy, structural organization, and content quality.
arXiv Detail & Related papers (2025-08-21T15:45:10Z)
Let's Use ChatGPT To Write Our Paper! Benchmarking LLMs To Write the Introduction of a Research Paper [64.50822834679101]
SciIG is a task that evaluates LLMs' ability to produce coherent introductions from titles, abstracts, and related works.<n>We assess five state-of-the-art models, including open-source (DeepSeek-v3, Gemma-3-12B, LLaMA 4-Maverick, MistralAI Small 3.1) and closed-source GPT-4o systems.<n>Results demonstrate LLaMA-4 Maverick's superior performance on most metrics, particularly in semantic similarity and faithfulness.
arXiv Detail & Related papers (2025-08-19T21:11:11Z)
SciSage: A Multi-Agent Framework for High-Quality Scientific Survey Generation [2.985620880452744]
SciSage is a multi-agent framework employing a reflect-when-you-write paradigm.<n>It critically evaluates drafts at outline, section, and document levels, collaborating with specialized agents for query interpretation, content retrieval, and refinement.<n>We also release SurveyScope, a benchmark of 46 high-impact papers ( 2020-2025) across 11 computer science domains.
arXiv Detail & Related papers (2025-06-15T02:23:47Z)
Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol [83.90769864167301]
Literature review tables are essential for summarizing and comparing collections of scientific papers.<n>We explore the task of generating tables that best fulfill a user's informational needs given a collection of scientific papers.<n>Our contributions focus on three key challenges encountered in real-world use: (i) User prompts are often under-specified; (ii) Retrieved candidate papers frequently contain irrelevant content; and (iii) Task evaluation should move beyond shallow text similarity techniques.
arXiv Detail & Related papers (2025-04-14T14:52:28Z)
SurveyForge: On the Outline Heuristics, Memory-Driven Generation, and Multi-dimensional Evaluation for Automated Survey Writing [13.101632066188532]
We introduce SurveyForge, which generates the outline by analyzing the logical structure of human-written outlines.<n>To achieve a comprehensive evaluation, we construct SurveyBench, which includes 100 human-written survey papers for win-rate comparison.<n>Experiments demonstrate that SurveyForge can outperform previous works such as AutoSurvey.
arXiv Detail & Related papers (2025-03-06T17:15:48Z)
ResearchArena: Benchmarking Large Language Models' Ability to Collect and Organize Information as Research Agents [30.603079363363634]
This study introduces ResearchArena, a benchmark designed to evaluate large language models' capabilities in conducting academic surveys.<n>ResearchArena models the process in three stages: (1) information discovery, identifying relevant literature; (2) information selection, evaluating papers' relevance and impact; and (3) information organization.<n>To support these evaluations, we construct an offline environment of 12M full-text academic papers and 7.9K survey papers.
arXiv Detail & Related papers (2024-06-13T03:26:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.