Related papers: $\texttt{COSMIC}$: Mutual Information for Task-Agnostic Summarization Evaluation

$\texttt{COSMIC}$: Mutual Information for Task-Agnostic Summarization Evaluation

URL: http://arxiv.org/abs/2402.19457v3
Date: Wed, 14 Aug 2024 14:06:10 GMT
Title: $\texttt{COSMIC}$: Mutual Information for Task-Agnostic Summarization Evaluation
Authors: Maxime Darrin, Philippe Formont, Jackie Chi Kit Cheung, Pablo Piantanida,
Abstract summary: We propose a novel task-oriented evaluation approach that assesses summarizers based on their capacity to produce summaries that are useful for downstream tasks, while preserving task outcomes. We introduce $textttCOSMIC$ as a practical implementation of this metric, demonstrating its strong correlation with human judgment-based metrics and its effectiveness in predicting downstream task performance.
Score: 39.287235598507294
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Assessing the quality of summarizers poses significant challenges. In response, we propose a novel task-oriented evaluation approach that assesses summarizers based on their capacity to produce summaries that are useful for downstream tasks, while preserving task outcomes. We theoretically establish a direct relationship between the resulting error probability of these tasks and the mutual information between source texts and generated summaries. We introduce $\texttt{COSMIC}$ as a practical implementation of this metric, demonstrating its strong correlation with human judgment-based metrics and its effectiveness in predicting downstream task performance. Comparative analyses against established metrics like $\texttt{BERTScore}$ and $\texttt{ROUGE}$ highlight the competitive performance of $\texttt{COSMIC}$.

Related papers

COMM:Concentrated Margin Maximization for Robust Document-Level Relation Extraction [5.291403671224172]
Document-level relation extraction (DocRE) is the process of identifying and extracting relations between entities that span multiple sentences within a document. The complexity inherent in DocRE makes the labeling process prone to errors, compounded by the extreme sparsity of positive relation samples. We have developed a robust framework called textittextbfCOMM to better solve DocRE.
arXiv Detail & Related papers (2025-03-18T04:31:57Z)
Is Summary Useful or Not? An Extrinsic Human Evaluation of Text Summaries on Downstream Tasks [45.550554287918885]
This paper focuses on evaluating the usefulness of text summaries with extrinsic methods. We design three different downstream tasks for extrinsic human evaluation of summaries, i.e., question answering, text classification and text similarity assessment. We find summaries are particularly useful in tasks that rely on an overall judgment of the text, while being less effective for question answering tasks.
arXiv Detail & Related papers (2023-05-24T11:34:39Z)
USB: A Unified Summarization Benchmark Across Tasks and Domains [68.82726887802856]
We introduce a Wikipedia-derived benchmark, complemented by a rich set of crowd-sourced annotations, that supports $8$ interrelated tasks. We compare various methods on this benchmark and discover that on multiple tasks, moderately-sized fine-tuned models consistently outperform much larger few-shot prompted language models.
arXiv Detail & Related papers (2023-05-23T17:39:54Z)
"It's a Match!" -- A Benchmark of Task Affinity Scores for Joint Learning [74.14961250042629]
Multi-Task Learning (MTL) promises attractive, characterizing the conditions of its success is still an open problem in Deep Learning. Estimateing task affinity for joint learning is a key endeavor. Recent work suggests that the training conditions themselves have a significant impact on the outcomes of MTL. Yet, the literature is lacking a benchmark to assess the effectiveness of tasks affinity estimation techniques.
arXiv Detail & Related papers (2023-01-07T15:16:35Z)
UniSumm and SummZoo: Unified Model and Diverse Benchmark for Few-Shot Summarization [54.59104881168188]
textscUniSumm is a unified few-shot summarization model pre-trained with multiple summarization tasks. textscSummZoo is a new benchmark to better evaluate few-shot summarizers.
arXiv Detail & Related papers (2022-11-17T18:54:47Z)
How to Find Strong Summary Coherence Measures? A Toolbox and a Comparative Study for Summary Coherence Measure Evaluation [3.434197496862117]
We conduct a large-scale investigation of various methods for summary coherence modelling on an even playing field. We introduce two novel analysis measures, intra-system correlation and bias matrices, that help identify biases in coherence measures and provide robustness against system-level confounders. While none of the currently available automatic coherence measures are able to assign reliable coherence scores to system summaries across all evaluation metrics, large-scale language models show promising results, as long as fine-tuning takes into account that they need to generalize across different summary lengths.
arXiv Detail & Related papers (2022-09-14T09:42:19Z)
Truth Discovery in Sequence Labels from Crowds [12.181422057560201]
Crowdsourcing platforms, such as Amazon Mechanical Turk (AMT), have been deployed to assist in this purpose. Existing literature in annotation aggregation assumes that annotations are independent and thus faces challenges when handling the sequential label aggregation tasks. We propose an optimization-based method that infers the ground truth labels using annotations provided by workers for sequential labeling tasks.
arXiv Detail & Related papers (2021-09-09T19:12:13Z)
Understanding the Extent to which Summarization Evaluation Metrics Measure the Information Quality of Summaries [74.28810048824519]
We analyze the token alignments used by ROUGE and BERTScore to compare summaries. We argue that their scores largely cannot be interpreted as measuring information overlap.
arXiv Detail & Related papers (2020-10-23T15:55:15Z)
Unsupervised Reference-Free Summary Quality Evaluation via Contrastive Learning [66.30909748400023]
We propose to evaluate the summary qualities without reference summaries by unsupervised contrastive learning. Specifically, we design a new metric which covers both linguistic qualities and semantic informativeness based on BERT. Experiments on Newsroom and CNN/Daily Mail demonstrate that our new evaluation method outperforms other metrics even without reference summaries.
arXiv Detail & Related papers (2020-10-05T05:04:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.