Benchmarking Large Language Models for News Summarization
- URL: http://arxiv.org/abs/2301.13848v1
- Date: Tue, 31 Jan 2023 18:46:19 GMT
- Title: Benchmarking Large Language Models for News Summarization
- Authors: Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen
McKeown, Tatsunori B. Hashimoto
- Abstract summary: Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood.
We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
- Score: 79.37850439866938
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have shown promise for automatic summarization
but the reasons behind their successes are poorly understood. By conducting a
human evaluation on ten LLMs across different pretraining methods, prompts, and
model scales, we make two important observations. First, we find instruction
tuning, and not model size, is the key to the LLM's zero-shot summarization
capability. Second, existing studies have been limited by low-quality
references, leading to underestimates of human performance and lower few-shot
and finetuning performance. To better evaluate LLMs, we perform human
evaluation over high-quality summaries we collect from freelance writers.
Despite major stylistic differences such as the amount of paraphrasing, we find
that LMM summaries are judged to be on par with human written summaries.
Related papers
- PRE: A Peer Review Based Large Language Model Evaluator [14.585292530642603]
Existing paradigms rely on either human annotators or model-based evaluators to evaluate the performance of LLMs.
We propose a novel framework that can automatically evaluate LLMs through a peer-review process.
arXiv Detail & Related papers (2024-01-28T12:33:14Z) - Zero-shot Conversational Summarization Evaluations with small Large
Language Models [7.525771026977357]
Large Language Models (LLMs) exhibit powerful summarization abilities.
We evaluate LLMs on conversational summarization and showcase their performance on various prompts.
We also evaluate the models with human evaluations and discuss the limitations of the models on conversational summarization.
arXiv Detail & Related papers (2023-11-29T19:34:34Z) - Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization.
Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z) - Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs.
We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z) - Summarization is (Almost) Dead [49.360752383801305]
We develop new datasets and conduct human evaluation experiments to evaluate the zero-shot generation capability of large language models (LLMs)
Our findings indicate a clear preference among human evaluators for LLM-generated summaries over human-written summaries and summaries generated by fine-tuned models.
arXiv Detail & Related papers (2023-09-18T08:13:01Z) - On Learning to Summarize with Large Language Models as References [101.79795027550959]
Large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets.
We study an LLM-as-reference learning setting for smaller text summarization models to investigate whether their performance can be substantially improved.
arXiv Detail & Related papers (2023-05-23T16:56:04Z) - Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided.
We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.