Related papers: A Critical Look at Meta-evaluating Summarisation Evaluation Metrics

A Critical Look at Meta-evaluating Summarisation Evaluation Metrics

URL: http://arxiv.org/abs/2409.19507v1
Date: Sun, 29 Sep 2024 01:30:13 GMT
Title: A Critical Look at Meta-evaluating Summarisation Evaluation Metrics
Authors: Xiang Dai, Sarvnaz Karimi, Biaoyan Fang,
Abstract summary: We argue that the time is ripe to build more diverse benchmarks that enable the development of more robust evaluation metrics. We call for research focusing on user-centric quality dimensions that consider the generated summary's communicative goal.
Score: 11.541368732416506
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Effective summarisation evaluation metrics enable researchers and practitioners to compare different summarisation systems efficiently. Estimating the effectiveness of an automatic evaluation metric, termed meta-evaluation, is a critically important research question. In this position paper, we review recent meta-evaluation practices for summarisation evaluation metrics and find that (1) evaluation metrics are primarily meta-evaluated on datasets consisting of examples from news summarisation datasets, and (2) there has been a noticeable shift in research focus towards evaluating the faithfulness of generated summaries. We argue that the time is ripe to build more diverse benchmarks that enable the development of more robust evaluation metrics and analyze the generalization ability of existing evaluation metrics. In addition, we call for research focusing on user-centric quality dimensions that consider the generated summary's communicative goal and the role of summarisation in the workflow.

Related papers

Reranking-based Generation for Unbiased Perspective Summarization [10.71668103641552]
We develop a test set for benchmarking metric reliability using human annotations.<n>We show that traditional metrics underperform compared to language model-based metrics, which prove to be strong evaluators.<n>Our findings aim to contribute to the reliable evaluation and development of perspective summarization methods.
arXiv Detail & Related papers (2025-06-19T00:01:43Z)
Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy [52.261323452286554]
We introduce a method for contextual metric meta-evaluation by comparing the local metric accuracy of evaluation metrics. Across translation, speech recognition, and ranking tasks, we demonstrate that the local metric accuracies vary both in absolute value and relative effectiveness as we shift across evaluation contexts.
arXiv Detail & Related papers (2025-03-25T16:42:25Z)
Evaluating Step-by-step Reasoning Traces: A Survey [3.895864050325129]
We propose a taxonomy of evaluation criteria with four top-level categories (groundedness, validity, coherence, and utility) We then categorize metrics based on their implementations, survey which metrics are used for assessing each criterion, and explore whether evaluator models can transfer across different criteria.
arXiv Detail & Related papers (2025-02-17T19:58:31Z)
Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! [80.3129093617928]
Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics. This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings. We introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness.
arXiv Detail & Related papers (2024-08-25T13:29:34Z)
Is Summary Useful or Not? An Extrinsic Human Evaluation of Text Summaries on Downstream Tasks [45.550554287918885]
This paper focuses on evaluating the usefulness of text summaries with extrinsic methods. We design three different downstream tasks for extrinsic human evaluation of summaries, i.e., question answering, text classification and text similarity assessment. We find summaries are particularly useful in tasks that rely on an overall judgment of the text, while being less effective for question answering tasks.
arXiv Detail & Related papers (2023-05-24T11:34:39Z)
ED-FAITH: Evaluating Dialogue Summarization on Faithfulness [35.73012379398233]
We first present a systematic study of faithfulness metrics for dialogue summarization. We observe that most metrics correlate poorly with human judgements despite performing well on news datasets. We propose T0-Score -- a new metric for faithfulness evaluation.
arXiv Detail & Related papers (2022-11-15T19:33:50Z)
How to Find Strong Summary Coherence Measures? A Toolbox and a Comparative Study for Summary Coherence Measure Evaluation [3.434197496862117]
We conduct a large-scale investigation of various methods for summary coherence modelling on an even playing field. We introduce two novel analysis measures, intra-system correlation and bias matrices, that help identify biases in coherence measures and provide robustness against system-level confounders. While none of the currently available automatic coherence measures are able to assign reliable coherence scores to system summaries across all evaluation metrics, large-scale language models show promising results, as long as fine-tuning takes into account that they need to generalize across different summary lengths.
arXiv Detail & Related papers (2022-09-14T09:42:19Z)
A Training-free and Reference-free Summarization Evaluation Metric via Centrality-weighted Relevance and Self-referenced Redundancy [60.419107377879925]
We propose a training-free and reference-free summarization evaluation metric. Our metric consists of a centrality-weighted relevance score and a self-referenced redundancy score. Our methods can significantly outperform existing methods on both multi-document and single-document summarization evaluation.
arXiv Detail & Related papers (2021-06-26T05:11:27Z)
GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics. Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation. It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)
Understanding the Extent to which Summarization Evaluation Metrics Measure the Information Quality of Summaries [74.28810048824519]
We analyze the token alignments used by ROUGE and BERTScore to compare summaries. We argue that their scores largely cannot be interpreted as measuring information overlap.
arXiv Detail & Related papers (2020-10-23T15:55:15Z)
Re-evaluating Evaluation in Text Summarization [77.4601291738445]
We re-evaluate the evaluation method for text summarization using top-scoring system outputs. We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.
arXiv Detail & Related papers (2020-10-14T13:58:53Z)
SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion. We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics. We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.