The Glass Ceiling of Automatic Evaluation in Natural Language Generation
- URL: http://arxiv.org/abs/2208.14585v1
- Date: Wed, 31 Aug 2022 01:13:46 GMT
- Title: The Glass Ceiling of Automatic Evaluation in Natural Language Generation
- Authors: Pierre Colombo, Maxime Peyrard, Nathan Noiry, Robert West, Pablo
Piantanida
- Abstract summary: We take a step back and analyze recent progress by comparing the body of existing automatic metrics and human metrics.
Our extensive statistical analysis reveals surprising findings: automatic metrics -- old and new -- are much more similar to each other than to humans.
- Score: 60.59732704936083
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic evaluation metrics capable of replacing human judgments are
critical to allowing fast development of new methods. Thus, numerous research
efforts have focused on crafting such metrics. In this work, we take a step
back and analyze recent progress by comparing the body of existing automatic
metrics and human metrics altogether. As metrics are used based on how they
rank systems, we compare metrics in the space of system rankings. Our extensive
statistical analysis reveals surprising findings: automatic metrics -- old and
new -- are much more similar to each other than to humans. Automatic metrics
are not complementary and rank systems similarly. Strikingly, human metrics
predict each other much better than the combination of all automatic metrics
used to predict a human metric. It is surprising because human metrics are
often designed to be independent, to capture different aspects of quality, e.g.
content fidelity or readability. We provide a discussion of these findings and
recommendations for future work in the field of evaluation.
Related papers
- Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! [80.3129093617928]
Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics.
This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings.
We introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness.
arXiv Detail & Related papers (2024-08-25T13:29:34Z) - Favi-Score: A Measure for Favoritism in Automated Preference Ratings for Generative AI Evaluation [10.776099974329647]
We introduce a formal definition of favoritism in preference metrics.
We show that favoritism is strongly related to errors in final system rankings.
We propose that preference-based metrics ought to be evaluated on both sign accuracy scores and favoritism.
arXiv Detail & Related papers (2024-06-03T09:20:46Z) - Correction of Errors in Preference Ratings from Automated Metrics for
Text Generation [4.661309379738428]
We propose a statistical model of Text Generation evaluation that accounts for the error-proneness of automated metrics.
We show that our model enables an efficient combination of human and automated ratings to remedy the error-proneness of the automated metrics.
arXiv Detail & Related papers (2023-06-06T17:09:29Z) - Re-Examining System-Level Correlations of Automatic Summarization
Evaluation Metrics [64.81682222169113]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations.
We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
arXiv Detail & Related papers (2022-04-21T15:52:14Z) - Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand [117.62186420147563]
We propose a generalization of leaderboards, bidimensional leaderboards (Billboards)
Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a Billboard accepts both generators and evaluation metrics as competing entries.
We demonstrate that a linear ensemble of a few diverse metrics sometimes substantially outperforms existing metrics in isolation.
arXiv Detail & Related papers (2021-12-08T06:34:58Z) - To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for
Machine Translation [5.972205906525993]
We investigate which metrics have the highest accuracy to make system-level quality rankings for pairs of systems.
We show that the sole use of BLEU negatively affected the past development of improved models.
arXiv Detail & Related papers (2021-07-22T17:22:22Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z) - A Human Evaluation of AMR-to-English Generation Systems [13.10463139842285]
We present the results of a new human evaluation which collects fluency and adequacy scores, as well as categorization of error types.
We discuss the relative quality of these systems and how our results compare to those of automatic metrics.
arXiv Detail & Related papers (2020-04-14T21:41:30Z) - PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative
Dialogue Systems [48.99561874529323]
There are three kinds of automatic methods to evaluate the open-domain generative dialogue systems.
Due to the lack of systematic comparison, it is not clear which kind of metrics are more effective.
We propose a novel and feasible learning-based metric that can significantly improve the correlation with human judgments.
arXiv Detail & Related papers (2020-04-06T04:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.