How to Find Strong Summary Coherence Measures? A Toolbox and a
Comparative Study for Summary Coherence Measure Evaluation
- URL: http://arxiv.org/abs/2209.06517v2
- Date: Thu, 15 Sep 2022 12:18:37 GMT
- Title: How to Find Strong Summary Coherence Measures? A Toolbox and a
Comparative Study for Summary Coherence Measure Evaluation
- Authors: Julius Steen and Katja Markert
- Abstract summary: We conduct a large-scale investigation of various methods for summary coherence modelling on an even playing field.
We introduce two novel analysis measures, intra-system correlation and bias matrices, that help identify biases in coherence measures and provide robustness against system-level confounders.
While none of the currently available automatic coherence measures are able to assign reliable coherence scores to system summaries across all evaluation metrics, large-scale language models show promising results, as long as fine-tuning takes into account that they need to generalize across different summary lengths.
- Score: 3.434197496862117
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatically evaluating the coherence of summaries is of great significance
both to enable cost-efficient summarizer evaluation and as a tool for improving
coherence by selecting high-scoring candidate summaries. While many different
approaches have been suggested to model summary coherence, they are often
evaluated using disparate datasets and metrics. This makes it difficult to
understand their relative performance and identify ways forward towards better
summary coherence modelling. In this work, we conduct a large-scale
investigation of various methods for summary coherence modelling on an even
playing field. Additionally, we introduce two novel analysis measures,
intra-system correlation and bias matrices, that help identify biases in
coherence measures and provide robustness against system-level confounders.
While none of the currently available automatic coherence measures are able to
assign reliable coherence scores to system summaries across all evaluation
metrics, large-scale language models fine-tuned on self-supervised tasks show
promising results, as long as fine-tuning takes into account that they need to
generalize across different summary lengths.
Related papers
- CREAM: Comparison-Based Reference-Free ELO-Ranked Automatic Evaluation for Meeting Summarization [37.44018461165065]
CREAM (Comparison-Based Reference-Free Elo-Ranked Automatic Evaluation for Meeting Summarization) is a novel framework that addresses the challenges of evaluating meeting summaries.
By employing an ELO ranking system, our approach provides a robust mechanism for comparing the quality of different models or prompt configurations.
arXiv Detail & Related papers (2024-09-17T04:39:20Z) - Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books.
Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z) - Is Summary Useful or Not? An Extrinsic Human Evaluation of Text
Summaries on Downstream Tasks [45.550554287918885]
This paper focuses on evaluating the usefulness of text summaries with extrinsic methods.
We design three different downstream tasks for extrinsic human evaluation of summaries, i.e., question answering, text classification and text similarity assessment.
We find summaries are particularly useful in tasks that rely on an overall judgment of the text, while being less effective for question answering tasks.
arXiv Detail & Related papers (2023-05-24T11:34:39Z) - Evaluating and Improving Factuality in Multimodal Abstractive
Summarization [91.46015013816083]
We propose CLIPBERTScore to leverage the robustness and strong factuality detection performance between image-summary and document-summary.
We show that this simple combination of two metrics in the zero-shot achieves higher correlations than existing factuality metrics for document summarization.
Our analysis demonstrates the robustness and high correlation of CLIPBERTScore and its components on four factuality metric-evaluation benchmarks.
arXiv Detail & Related papers (2022-11-04T16:50:40Z) - SNaC: Coherence Error Detection for Narrative Summarization [73.48220043216087]
We introduce SNaC, a narrative coherence evaluation framework rooted in fine-grained annotations for long summaries.
We develop a taxonomy of coherence errors in generated narrative summaries and collect span-level annotations for 6.6k sentences across 150 book and movie screenplay summaries.
Our work provides the first characterization of coherence errors generated by state-of-the-art summarization models and a protocol for eliciting coherence judgments from crowd annotators.
arXiv Detail & Related papers (2022-05-19T16:01:47Z) - TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks.
Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations.
Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z) - Improving Factual Consistency of Abstractive Summarization via Question
Answering [25.725873545789046]
We present an approach to address factual consistency in summarization.
We first propose an efficient automatic evaluation metric to measure factual consistency.
We then propose a novel learning algorithm that maximizes the proposed metric during model training.
arXiv Detail & Related papers (2021-05-10T19:07:21Z) - A Statistical Analysis of Summarization Evaluation Metrics using
Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are.
Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z) - Understanding the Extent to which Summarization Evaluation Metrics
Measure the Information Quality of Summaries [74.28810048824519]
We analyze the token alignments used by ROUGE and BERTScore to compare summaries.
We argue that their scores largely cannot be interpreted as measuring information overlap.
arXiv Detail & Related papers (2020-10-23T15:55:15Z) - SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion.
We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics.
We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.