Related papers: SurveyEval: Towards Comprehensive Evaluation of LLM-Generated Academic Surveys

SurveyEval: Towards Comprehensive Evaluation of LLM-Generated Academic Surveys

URL: http://arxiv.org/abs/2512.02763v1
Date: Tue, 02 Dec 2025 13:42:09 GMT
Title: SurveyEval: Towards Comprehensive Evaluation of LLM-Generated Academic Surveys
Authors: Jiahao Zhao, Shuaixing Zhang, Nan Xu, Lei Wang,
Abstract summary: We introduce SurveyEval, a benchmark that evaluates automatically generated surveys across three dimensions: overall quality, outline coherence, and reference accuracy.<n>We extend the evaluation across 7 subjects and augment the LLM-as-a-Judge framework with human references to strengthen evaluation-human alignment.
Score: 25.85280799022144
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: LLM-based automatic survey systems are transforming how users acquire information from the web by integrating retrieval, organization, and content synthesis into end-to-end generation pipelines. While recent works focus on developing new generation pipelines, how to evaluate such complex systems remains a significant challenge. To this end, we introduce SurveyEval, a comprehensive benchmark that evaluates automatically generated surveys across three dimensions: overall quality, outline coherence, and reference accuracy. We extend the evaluation across 7 subjects and augment the LLM-as-a-Judge framework with human references to strengthen evaluation-human alignment. Evaluation results show that while general long-text or paper-writing systems tend to produce lower-quality surveys, specialized survey-generation systems are able to deliver substantially higher-quality results. We envision SurveyEval as a scalable testbed to understand and improve automatic survey systems across diverse subjects and evaluation criteria.

Related papers

DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing [53.85037373860246]
We introduce Deep Synth-Eval, a benchmark designed to objectively evaluate information consolidation capabilities.<n>We propose a fine-grained evaluation protocol using General Checklists (for factual coverage) and Constraint Checklists (for structural organization)<n>Our results demonstrate that agentic plan-and-write significantly outperform single-turn generation.
arXiv Detail & Related papers (2026-01-07T03:07:52Z)
AutoSurvey2: Empowering Researchers with Next Level Automated Literature Surveys [10.50820843303237]
This paper introduces autosurvey2, a multi-stage pipeline that automates survey generation through retrieval-augmented synthesis and structured evaluation.<n>The system integrates parallel section generation, iterative refinement, and real-time retrieval of recent publications to ensure both topical completeness and factual accuracy.<n> Experimental results demonstrate that autosurvey2 consistently outperforms existing retrieval-based and automated baselines.
arXiv Detail & Related papers (2025-10-29T22:57:03Z)
A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System [56.40989626804489]
This survey provides the first holistic analysis of Large Language Models-powered software engineering.<n>We review over 150 recent papers and propose a taxonomy along two key dimensions: (1) Solutions, categorized into prompt-based, fine-tuning-based, and agent-based paradigms, and (2) Benchmarks, including tasks such as code generation, translation, and repair.
arXiv Detail & Related papers (2025-10-10T06:56:50Z)
SurveyGen: Quality-Aware Scientific Survey Generation with Large Language Models [14.855783196702191]
We present SurveyGen, a large-scale dataset comprising over 4,200 human-written surveys across diverse scientific domains.<n>We build QUAL-SG, a novel quality-aware framework for survey generation.
arXiv Detail & Related papers (2025-08-25T04:22:23Z)
SGSimEval: A Comprehensive Multifaceted and Similarity-Enhanced Benchmark for Automatic Survey Generation Systems [26.888698710786507]
SGSimEval is a comprehensive benchmark for Survey Generation with Similarity-Enhanced Evaluation.<n>We introduce human preference metrics that emphasize both inherent quality and similarity to humans.<n>Our experiments reveal that current ASG systems demonstrate human-comparable superiority in outline generation.
arXiv Detail & Related papers (2025-08-15T08:27:58Z)
Can LLMs Be Trusted for Evaluating RAG Systems? A Survey of Methods and Datasets [0.0]
Retrieval-Augmented Generation (RAG) has advanced significantly in recent years.<n>RAG complexity poses substantial challenges for systematic evaluation and quality enhancement.<n>This study systematically reviews 63 academic articles to provide a comprehensive overview of state-of-the-art RAG evaluation methodologies.
arXiv Detail & Related papers (2025-04-28T08:22:19Z)
SurveyX: Academic Survey Automation via Large Language Models [22.597703631935463]
SurveyX is an efficient and organized system for automated survey generation.<n>It decomposes the survey composing process into two phases: Preparation and Generation.<n>It significantly enhances the efficacy of survey composition.
arXiv Detail & Related papers (2025-02-20T17:59:45Z)
Auto-PRE: An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation [52.76508734756661]
Auto-PRE is an automatic evaluation framework inspired by the peer review process.<n>Unlike previous approaches that rely on human annotations, Auto-PRE automatically selects evaluators based on three core traits.<n> Experiments on three representative tasks, including summarization, non-factoid QA, and dialogue generation, demonstrate that Auto-PRE achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-10-16T06:06:06Z)
Trustworthiness in Retrieval-Augmented Generation Systems: A Survey [59.26328612791924]
Retrieval-Augmented Generation (RAG) has quickly grown into a pivotal paradigm in the development of Large Language Models (LLMs) We propose a unified framework that assesses the trustworthiness of RAG systems across six key dimensions: factuality, robustness, fairness, transparency, accountability, and privacy.
arXiv Detail & Related papers (2024-09-16T09:06:44Z)
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [66.93260816493553]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.<n>With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance.<n> Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z)
PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation. It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers. It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z)
Re-evaluating Evaluation in Text Summarization [77.4601291738445]
We re-evaluate the evaluation method for text summarization using top-scoring system outputs. We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.
arXiv Detail & Related papers (2020-10-14T13:58:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.