Related papers: Not too long do read: Evaluating LLM-generated extreme scientific summaries

Not too long do read: Evaluating LLM-generated extreme scientific summaries

URL: http://arxiv.org/abs/2512.23206v1
Date: Mon, 29 Dec 2025 05:03:02 GMT
Title: Not too long do read: Evaluating LLM-generated extreme scientific summaries
Authors: Zhuoqi Lyu, Qing Ke,
Abstract summary: We propose a novel dataset, BiomedTLDR, containing a large sample of researcher-authored summaries from scientific papers.<n>We then test popular open-weight LLMs for generating extreme summaries based on abstracts.<n>Our analysis reveals that, although some of them successfully produce humanoid summaries, LLMs generally exhibit a greater affinity for the original text's lexical choices and rhetorical structures.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: High-quality scientific extreme summary (TLDR) facilitates effective science communication. How do large language models (LLMs) perform in generating them? How are LLM-generated summaries different from those written by human experts? However, the lack of a comprehensive, high-quality scientific TLDR dataset hinders both the development and evaluation of LLMs' summarization ability. To address these, we propose a novel dataset, BiomedTLDR, containing a large sample of researcher-authored summaries from scientific papers, which leverages the common practice of including authors' comments alongside bibliography items. We then test popular open-weight LLMs for generating TLDRs based on abstracts. Our analysis reveals that, although some of them successfully produce humanoid summaries, LLMs generally exhibit a greater affinity for the original text's lexical choices and rhetorical structures, hence tend to be more extractive rather than abstractive in general, compared to humans. Our code and datasets are available at https://github.com/netknowledge/LLM_summarization (Lyu and Ke, 2025).

Related papers

RAVEL: Reasoning Agents for Validating and Evaluating LLM Text Synthesis [78.32151470154422]
We introduce RAVEL, an agentic framework that enables the testers to autonomously plan and execute typical synthesis operations.<n>We present C3EBench, a benchmark comprising 1,258 samples derived from professional human writings.<n>By augmenting RAVEL with SOTA LLMs as operators, we find that such agentic text synthesis is dominated by the LLM's reasoning capability.
arXiv Detail & Related papers (2026-02-28T14:47:34Z)
Blind to the Human Touch: Overlap Bias in LLM-Based Summary Evaluation [89.52571224447111]
Large language model (LLM) judges have often been used alongside traditional, algorithm-based metrics for tasks like summarization.<n>We provide an LLM judge bias analysis as a function of overlap with human-written responses in the domain of summarization.
arXiv Detail & Related papers (2026-02-07T19:39:28Z)
How Do LLM-Generated Texts Impact Term-Based Retrieval Models? [76.92519309816008]
This paper investigates the influence of large language models (LLMs) on term-based retrieval models.<n>Our linguistic analysis reveals that LLM-generated texts exhibit smoother high-frequency and steeper low-frequency Zipf slopes.<n>Our study further explores whether term-based retrieval models demonstrate source bias, concluding that these models prioritize documents whose term distributions closely correspond to those of the queries.
arXiv Detail & Related papers (2025-08-25T06:43:27Z)
Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs)<n>We find that fine-tuning text embedding models on LLM-generated texts yields excellent classification accuracy.<n>We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z)
Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review [66.73247554182376]
Large language models (LLMs) have led to their integration into peer review.<n>The unchecked adoption of LLMs poses significant risks to the integrity of the peer review system.<n>We show that manipulating 5% of the reviews could potentially cause 12% of the papers to lose their position in the top 30% rankings.
arXiv Detail & Related papers (2024-12-02T16:55:03Z)
Calibration of Large Language Models on Code Summarization [4.4378250612684]
We study how closely AI-generated summaries resemble a summary a human might have produced.<n>Measures such as BERTScore and BLEU have been suggested and evaluated with human-subject studies.
arXiv Detail & Related papers (2024-04-30T07:38:08Z)
Mapping the Increasing Use of LLMs in Scientific Papers [99.67983375899719]
We conduct the first systematic, large-scale analysis across 950,965 papers published between January 2020 and February 2024 on the arXiv, bioRxiv, and Nature portfolio journals. Our findings reveal a steady increase in LLM usage, with the largest and fastest growth observed in Computer Science papers.
arXiv Detail & Related papers (2024-04-01T17:45:15Z)
Can Large Language Model Summarizers Adapt to Diverse Scientific Communication Goals? [19.814974042343028]
We investigate the controllability of large language models (LLMs) on scientific summarization tasks. We find that non-fine-tuned LLMs outperform humans in the MuP review generation task.
arXiv Detail & Related papers (2024-01-18T23:00:54Z)
Summarization is (Almost) Dead [49.360752383801305]
We develop new datasets and conduct human evaluation experiments to evaluate the zero-shot generation capability of large language models (LLMs) Our findings indicate a clear preference among human evaluators for LLM-generated summaries over human-written summaries and summaries generated by fine-tuned models.
arXiv Detail & Related papers (2023-09-18T08:13:01Z)
On Learning to Summarize with Large Language Models as References [101.79795027550959]
Large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets. We study an LLM-as-reference learning setting for smaller text summarization models to investigate whether their performance can be substantially improved.
arXiv Detail & Related papers (2023-05-23T16:56:04Z)
Element-aware Summarization with Large Language Models: Expert-aligned Evaluation and Chain-of-Thought Method [35.181659789684545]
Automatic summarization generates concise summaries that contain key ideas of source documents. References from CNN/DailyMail and BBC XSum are noisy, mainly in terms of factual hallucination and information redundancy. We propose a Summary Chain-of-Thought (SumCoT) technique to elicit LLMs to generate summaries step by step. Experimental results show our method outperforms state-of-the-art fine-tuned PLMs and zero-shot LLMs by +4.33/+4.77 in ROUGE-L.
arXiv Detail & Related papers (2023-05-22T18:54:35Z)
CiteSum: Citation Text-guided Scientific Extreme Summarization and Low-resource Domain Adaptation [41.494287785760534]
We create a new benchmark CiteSum without human annotation, which is around 30 times larger than the previous human-curated dataset SciTLDR. For scientific extreme summarization, CITES outperforms most fully-supervised methods on SciTLDR without any fine-tuning. For news extreme summarization, CITES achieves significant gains on XSum over its base model.
arXiv Detail & Related papers (2022-05-12T16:44:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.