Play the Shannon Game With Language Models: A Human-Free Approach to
Summary Evaluation
- URL: http://arxiv.org/abs/2103.10918v1
- Date: Fri, 19 Mar 2021 17:27:58 GMT
- Title: Play the Shannon Game With Language Models: A Human-Free Approach to
Summary Evaluation
- Authors: Nicholas Egan, Oleg Vasilyev, John Bohannon
- Abstract summary: We introduce new summary evaluation metrics that use a pretrained language model to estimate the information shared between a document and its summary.
These metrics are a modern take on the Shannon Game, a method for summary quality scoring proposed decades ago.
We empirically verify that the introduced metrics correlate with human judgement based on coverage, overall quality, and five summary dimensions.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of a summary is to concisely state the most important information in
a document. With this principle in mind, we introduce new reference-free
summary evaluation metrics that use a pretrained language model to estimate the
information shared between a document and its summary. These metrics are a
modern take on the Shannon Game, a method for summary quality scoring proposed
decades ago, where we replace human annotators with language models. We also
view these metrics as an extension of BLANC, a recently proposed approach to
summary quality measurement based on the performance of a language model with
and without the help of a summary. Using GPT-2, we empirically verify that the
introduced metrics correlate with human judgement based on coverage, overall
quality, and five summary dimensions.
Related papers
- Information-Theoretic Distillation for Reference-less Summarization [67.51150817011617]
We present a novel framework to distill a powerful summarizer based on the information-theoretic objective for summarization.
We start off from Pythia-2.8B as the teacher model, which is not yet capable of summarization.
We arrive at a compact but powerful summarizer with only 568M parameters that performs competitively against ChatGPT.
arXiv Detail & Related papers (2024-03-20T17:42:08Z) - Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context.
Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS)
Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z) - Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural
Language Generation [68.9440575276396]
This survey aims to provide an overview of the recent research that has leveraged human feedback to improve natural language generation.
First, we introduce an encompassing formalization of feedback, and identify and organize existing research into a taxonomy following this formalization.
Second, we discuss how feedback can be described by its format and objective, and cover the two approaches proposed to use feedback (either for training or decoding): directly using the feedback or training feedback models.
Third, we provide an overview of the nascent field of AI feedback, which exploits large language models to make judgments based on a set of principles and minimize the need for
arXiv Detail & Related papers (2023-05-01T17:36:06Z) - Understanding the Extent to which Summarization Evaluation Metrics
Measure the Information Quality of Summaries [74.28810048824519]
We analyze the token alignments used by ROUGE and BERTScore to compare summaries.
We argue that their scores largely cannot be interpreted as measuring information overlap.
arXiv Detail & Related papers (2020-10-23T15:55:15Z) - WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive
Summarization [41.578594261746055]
We introduce WikiLingua, a large-scale, multilingual dataset for the evaluation of crosslingual abstractive summarization systems.
We extract article and summary pairs in 18 languages from WikiHow, a high quality, collaborative resource of how-to guides on a diverse set of topics written by human authors.
We create gold-standard article-summary alignments across languages by aligning the images that are used to describe each how-to step in an article.
arXiv Detail & Related papers (2020-10-07T00:28:05Z) - Unsupervised Reference-Free Summary Quality Evaluation via Contrastive
Learning [66.30909748400023]
We propose to evaluate the summary qualities without reference summaries by unsupervised contrastive learning.
Specifically, we design a new metric which covers both linguistic qualities and semantic informativeness based on BERT.
Experiments on Newsroom and CNN/Daily Mail demonstrate that our new evaluation method outperforms other metrics even without reference summaries.
arXiv Detail & Related papers (2020-10-05T05:04:14Z) - SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion.
We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics.
We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z) - Fill in the BLANC: Human-free quality estimation of document summaries [11.92436948211501]
We present BLANC, a new approach to the automatic estimation of document summary quality.
BLANC scores have as good correlation with human evaluations as do the ROUGE family of summary quality measurements.
arXiv Detail & Related papers (2020-02-23T06:21:43Z) - Learning by Semantic Similarity Makes Abstractive Summarization Better [13.324006587838522]
We compare the generated summaries from recent LM, BART, and the reference summaries from a benchmark dataset, CNN/DM.
Interestingly, model-generated summaries receive higher scores relative to reference summaries.
arXiv Detail & Related papers (2020-02-18T17:59:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.