GUMSum: Multi-Genre Data and Evaluation for English Abstractive
Summarization
- URL: http://arxiv.org/abs/2306.11256v1
- Date: Tue, 20 Jun 2023 03:21:10 GMT
- Title: GUMSum: Multi-Genre Data and Evaluation for English Abstractive
Summarization
- Authors: Yang Janet Liu and Amir Zeldes
- Abstract summary: Automatic summarization with pre-trained language models has led to impressively fluent results, but is prone to 'hallucinations'
We present GUMSum, a dataset of English summaries in 12 written and spoken genres for evaluation of abstractive summarization.
- Score: 10.609715843964263
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic summarization with pre-trained language models has led to
impressively fluent results, but is prone to 'hallucinations', low performance
on non-news genres, and outputs which are not exactly summaries. Targeting ACL
2023's 'Reality Check' theme, we present GUMSum, a small but carefully crafted
dataset of English summaries in 12 written and spoken genres for evaluation of
abstractive summarization. Summaries are highly constrained, focusing on
substitutive potential, factuality, and faithfulness. We present guidelines and
evaluate human agreement as well as subjective judgments on recent system
outputs, comparing general-domain untuned approaches, a fine-tuned one, and a
prompt-based approach, to human performance. Results show that while GPT3
achieves impressive scores, it still underperforms humans, with varying quality
across genres. Human judgments reveal different types of errors in supervised,
prompted, and human-generated summaries, shedding light on the challenges of
producing a good summary.
Related papers
- GUMsley: Evaluating Entity Salience in Summarization for 12 English
Genres [14.37990666928991]
We present and evaluate GUMsley, the first entity salience dataset covering all named and non-named salient entities for 12 genres of English text.
We show that predicting or providing salient entities to several model architectures enhances performance and helps derive higher-quality summaries.
arXiv Detail & Related papers (2024-01-31T16:30:50Z) - AugSumm: towards generalizable speech summarization using synthetic
labels from large language model [61.73741195292997]
Abstractive speech summarization (SSUM) aims to generate human-like summaries from speech.
conventional SSUM models are mostly trained and evaluated with a single ground-truth (GT) human-annotated deterministic summary.
We propose AugSumm, a method to leverage large language models (LLMs) as a proxy for human annotators to generate augmented summaries.
arXiv Detail & Related papers (2024-01-10T18:39:46Z) - Is Summary Useful or Not? An Extrinsic Human Evaluation of Text
Summaries on Downstream Tasks [45.550554287918885]
This paper focuses on evaluating the usefulness of text summaries with extrinsic methods.
We design three different downstream tasks for extrinsic human evaluation of summaries, i.e., question answering, text classification and text similarity assessment.
We find summaries are particularly useful in tasks that rely on an overall judgment of the text, while being less effective for question answering tasks.
arXiv Detail & Related papers (2023-05-24T11:34:39Z) - ChatGPT as a Factual Inconsistency Evaluator for Text Summarization [17.166794984161964]
We show that ChatGPT can evaluate factual inconsistency under a zero-shot setting.
It generally outperforms previous evaluation metrics on binary entailment inference, summary ranking, and consistency rating.
However, a closer inspection of ChatGPT's output reveals certain limitations including its preference for more lexically similar candidates, false reasoning, and inadequate understanding of instructions.
arXiv Detail & Related papers (2023-03-27T22:30:39Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - Human-in-the-loop Abstractive Dialogue Summarization [61.4108097664697]
We propose to incorporate different levels of human feedback into the training process.
This will enable us to guide the models to capture the behaviors humans care about for summaries.
arXiv Detail & Related papers (2022-12-19T19:11:27Z) - Unsupervised Reference-Free Summary Quality Evaluation via Contrastive
Learning [66.30909748400023]
We propose to evaluate the summary qualities without reference summaries by unsupervised contrastive learning.
Specifically, we design a new metric which covers both linguistic qualities and semantic informativeness based on BERT.
Experiments on Newsroom and CNN/Daily Mail demonstrate that our new evaluation method outperforms other metrics even without reference summaries.
arXiv Detail & Related papers (2020-10-05T05:04:14Z) - SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion.
We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics.
We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z) - On Faithfulness and Factuality in Abstractive Summarization [17.261247316769484]
We analyzed limitations of neural text generation models for abstractive document summarization.
We found that these models are highly prone to hallucinate content that is unfaithful to the input document.
We show that textual entailment measures better correlate with faithfulness than standard metrics.
arXiv Detail & Related papers (2020-05-02T00:09:16Z) - Unsupervised Opinion Summarization with Noising and Denoising [85.49169453434554]
We create a synthetic dataset from a corpus of user reviews by sampling a review, pretending it is a summary, and generating noisy versions thereof.
At test time, the model accepts genuine reviews and generates a summary containing salient opinions, treating those that do not reach consensus as noise.
arXiv Detail & Related papers (2020-04-21T16:54:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.