Related papers: SurveyGen: Quality-Aware Scientific Survey Generation with Large Language Models

SurveyGen: Quality-Aware Scientific Survey Generation with Large Language Models

URL: http://arxiv.org/abs/2508.17647v1
Date: Mon, 25 Aug 2025 04:22:23 GMT
Title: SurveyGen: Quality-Aware Scientific Survey Generation with Large Language Models
Authors: Tong Bao, Mir Tafseer Nayeem, Davood Rafiei, Chengzhi Zhang,
Abstract summary: We present SurveyGen, a large-scale dataset comprising over 4,200 human-written surveys across diverse scientific domains.<n>We build QUAL-SG, a novel quality-aware framework for survey generation.
Score: 14.855783196702191
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automatic survey generation has emerged as a key task in scientific document processing. While large language models (LLMs) have shown promise in generating survey texts, the lack of standardized evaluation datasets critically hampers rigorous assessment of their performance against human-written surveys. In this work, we present SurveyGen, a large-scale dataset comprising over 4,200 human-written surveys across diverse scientific domains, along with 242,143 cited references and extensive quality-related metadata for both the surveys and the cited papers. Leveraging this resource, we build QUAL-SG, a novel quality-aware framework for survey generation that enhances the standard Retrieval-Augmented Generation (RAG) pipeline by incorporating quality-aware indicators into literature retrieval to assess and select higher-quality source papers. Using this dataset and framework, we systematically evaluate state-of-the-art LLMs under varying levels of human involvement - from fully automatic generation to human-guided writing. Experimental results and human evaluations show that while semi-automatic pipelines can achieve partially competitive outcomes, fully automatic survey generation still suffers from low citation quality and limited critical analysis.

Related papers

DeepSurvey-Bench: Evaluating Academic Value of Automatically Generated Scientific Survey [53.85391477976017]
DeepSurvey-Bench is a novel benchmark designed to comprehensively evaluate the academic value of generated surveys.<n>We construct a reliable dataset with academic value annotations, and evaluate the deep academic value of the generated surveys.
arXiv Detail & Related papers (2026-01-13T14:42:56Z)
SurveyEval: Towards Comprehensive Evaluation of LLM-Generated Academic Surveys [25.85280799022144]
We introduce SurveyEval, a benchmark that evaluates automatically generated surveys across three dimensions: overall quality, outline coherence, and reference accuracy.<n>We extend the evaluation across 7 subjects and augment the LLM-as-a-Judge framework with human references to strengthen evaluation-human alignment.
arXiv Detail & Related papers (2025-12-02T13:42:09Z)
AutoMalDesc: Large-Scale Script Analysis for Cyber Threat Research [81.04845910798387]
Generating natural language explanations for threat detections remains an open problem in cybersecurity research.<n>We present AutoMalDesc, an automated static analysis summarization framework that operates independently at scale.<n>We publish our complete dataset of more than 100K script samples, including annotated seed (0.9K) datasets, along with our methodology and evaluation framework.
arXiv Detail & Related papers (2025-11-17T13:05:25Z)
AutoSurvey2: Empowering Researchers with Next Level Automated Literature Surveys [10.50820843303237]
This paper introduces autosurvey2, a multi-stage pipeline that automates survey generation through retrieval-augmented synthesis and structured evaluation.<n>The system integrates parallel section generation, iterative refinement, and real-time retrieval of recent publications to ensure both topical completeness and factual accuracy.<n> Experimental results demonstrate that autosurvey2 consistently outperforms existing retrieval-based and automated baselines.
arXiv Detail & Related papers (2025-10-29T22:57:03Z)
Benchmarking Computer Science Survey Generation [18.844790013427282]
SurGE (Survey Generation Evaluation) is a new benchmark for evaluating scientific survey generation in the computer science domain.<n>SurGE consists of (1) a collection of test instances, each including a topic description, an expert-written survey, and its full set of cited references, and (2) a large-scale academic corpus of over one million papers that serves as the retrieval pool.<n>In addition, we propose an automated evaluation framework that measures generated surveys across four dimensions: information coverage, referencing accuracy, structural organization, and content quality.
arXiv Detail & Related papers (2025-08-21T15:45:10Z)
SciSage: A Multi-Agent Framework for High-Quality Scientific Survey Generation [2.985620880452744]
SciSage is a multi-agent framework employing a reflect-when-you-write paradigm.<n>It critically evaluates drafts at outline, section, and document levels, collaborating with specialized agents for query interpretation, content retrieval, and refinement.<n>We also release SurveyScope, a benchmark of 46 high-impact papers ( 2020-2025) across 11 computer science domains.
arXiv Detail & Related papers (2025-06-15T02:23:47Z)
SurveyForge: On the Outline Heuristics, Memory-Driven Generation, and Multi-dimensional Evaluation for Automated Survey Writing [13.101632066188532]
We introduce SurveyForge, which generates the outline by analyzing the logical structure of human-written outlines.<n>To achieve a comprehensive evaluation, we construct SurveyBench, which includes 100 human-written survey papers for win-rate comparison.<n>Experiments demonstrate that SurveyForge can outperform previous works such as AutoSurvey.
arXiv Detail & Related papers (2025-03-06T17:15:48Z)
SurveyX: Academic Survey Automation via Large Language Models [22.597703631935463]
SurveyX is an efficient and organized system for automated survey generation.<n>It decomposes the survey composing process into two phases: Preparation and Generation.<n>It significantly enhances the efficacy of survey composition.
arXiv Detail & Related papers (2025-02-20T17:59:45Z)
What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation [57.550045763103334]
evaluating a story can be more challenging than other generation evaluation tasks. We first summarize existing storytelling tasks, including text-to-text, visual-to-text, and text-to-visual. We propose a taxonomy to organize evaluation metrics that have been developed or can be adopted for story evaluation.
arXiv Detail & Related papers (2024-08-26T20:35:42Z)
Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z)
PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation. It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers. It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z)
INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback [80.57617091714448]
We present InstructScore, an explainable evaluation metric for text generation. We fine-tune a text evaluation metric based on LLaMA, producing a score for generated text and a human readable diagnostic report.
arXiv Detail & Related papers (2023-05-23T17:27:22Z)
Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study [86.62171568318716]
Large generative language models such as GPT-2 are well-known for their ability to generate text. We show that unsupervised predictors of "page quality" emerge, able to detect low quality content without any training. We conduct extensive qualitative and quantitative analysis over 500 million web articles, making this the largest-scale study ever conducted on the topic.
arXiv Detail & Related papers (2020-08-17T07:13:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.