Re-Examining System-Level Correlations of Automatic Summarization
Evaluation Metrics
- URL: http://arxiv.org/abs/2204.10216v1
- Date: Thu, 21 Apr 2022 15:52:14 GMT
- Title: Re-Examining System-Level Correlations of Automatic Summarization
Evaluation Metrics
- Authors: Daniel Deutsch and Rotem Dror and Dan Roth
- Abstract summary: How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations.
We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
- Score: 64.81682222169113
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How reliably an automatic summarization evaluation metric replicates human
judgments of summary quality is quantified by system-level correlations. We
identify two ways in which the definition of the system-level correlation is
inconsistent with how metrics are used to evaluate systems in practice and
propose changes to rectify this disconnect. First, we calculate the system
score for an automatic metric using the full test set instead of the subset of
summaries judged by humans, which is currently standard practice. We
demonstrate how this small change leads to more precise estimates of
system-level correlations. Second, we propose to calculate correlations only on
pairs of systems that are separated by small differences in automatic scores
which are commonly observed in practice. This allows us to demonstrate that our
best estimate of the correlation of ROUGE to human judgments is near 0 in
realistic scenarios. The results from the analyses point to the need to collect
more high-quality human judgments and to improve automatic metrics when
differences in system scores are small.
Related papers
- Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge [51.93909886542317]
We show how *relying on a single aggregate correlation score* can obscure fundamental differences between human behavior and automatic evaluation methods.
We propose stratifying results by human label uncertainty to provide a more robust analysis of automatic evaluation performance.
arXiv Detail & Related papers (2024-10-03T03:08:29Z) - Correction of Errors in Preference Ratings from Automated Metrics for
Text Generation [4.661309379738428]
We propose a statistical model of Text Generation evaluation that accounts for the error-proneness of automated metrics.
We show that our model enables an efficient combination of human and automated ratings to remedy the error-proneness of the automated metrics.
arXiv Detail & Related papers (2023-06-06T17:09:29Z) - The Glass Ceiling of Automatic Evaluation in Natural Language Generation [60.59732704936083]
We take a step back and analyze recent progress by comparing the body of existing automatic metrics and human metrics.
Our extensive statistical analysis reveals surprising findings: automatic metrics -- old and new -- are much more similar to each other than to humans.
arXiv Detail & Related papers (2022-08-31T01:13:46Z) - A Statistical Analysis of Summarization Evaluation Metrics using
Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are.
Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z) - A Human Evaluation of AMR-to-English Generation Systems [13.10463139842285]
We present the results of a new human evaluation which collects fluency and adequacy scores, as well as categorization of error types.
We discuss the relative quality of these systems and how our results compare to those of automatic metrics.
arXiv Detail & Related papers (2020-04-14T21:41:30Z) - PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative
Dialogue Systems [48.99561874529323]
There are three kinds of automatic methods to evaluate the open-domain generative dialogue systems.
Due to the lack of systematic comparison, it is not clear which kind of metrics are more effective.
We propose a novel and feasible learning-based metric that can significantly improve the correlation with human judgments.
arXiv Detail & Related papers (2020-04-06T04:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.