Correction of Errors in Preference Ratings from Automated Metrics for
Text Generation
- URL: http://arxiv.org/abs/2306.03866v1
- Date: Tue, 6 Jun 2023 17:09:29 GMT
- Title: Correction of Errors in Preference Ratings from Automated Metrics for
Text Generation
- Authors: Jan Deriu, Pius von D\"aniken, Don Tuggener, Mark Cieliebak
- Abstract summary: We propose a statistical model of Text Generation evaluation that accounts for the error-proneness of automated metrics.
We show that our model enables an efficient combination of human and automated ratings to remedy the error-proneness of the automated metrics.
- Score: 4.661309379738428
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A major challenge in the field of Text Generation is evaluation: Human
evaluations are cost-intensive, and automated metrics often display
considerable disagreement with human judgments. In this paper, we propose a
statistical model of Text Generation evaluation that accounts for the
error-proneness of automated metrics when used to generate preference rankings
between system outputs. We show that existing automated metrics are generally
over-confident in assigning significant differences between systems in this
setting. However, our model enables an efficient combination of human and
automated ratings to remedy the error-proneness of the automated metrics. We
show that using this combination, we only require about 50% of the human
annotations typically used in evaluations to arrive at robust and statistically
significant results while yielding the same evaluation outcome as the pure
human evaluation in 95% of cases. We showcase the benefits of approach for
three text generation tasks: dialogue systems, machine translation, and text
summarization.
Related papers
- MISMATCH: Fine-grained Evaluation of Machine-generated Text with
Mismatch Error Types [68.76742370525234]
We propose a new evaluation scheme to model human judgments in 7 NLP tasks, based on the fine-grained mismatches between a pair of texts.
Inspired by the recent efforts in several NLP tasks for fine-grained evaluation, we introduce a set of 13 mismatch error types.
We show that the mismatch errors between the sentence pairs on the held-out datasets from 7 NLP tasks align well with the human evaluation.
arXiv Detail & Related papers (2023-06-18T01:38:53Z) - The Glass Ceiling of Automatic Evaluation in Natural Language Generation [60.59732704936083]
We take a step back and analyze recent progress by comparing the body of existing automatic metrics and human metrics.
Our extensive statistical analysis reveals surprising findings: automatic metrics -- old and new -- are much more similar to each other than to humans.
arXiv Detail & Related papers (2022-08-31T01:13:46Z) - Re-Examining System-Level Correlations of Automatic Summarization
Evaluation Metrics [64.81682222169113]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations.
We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
arXiv Detail & Related papers (2022-04-21T15:52:14Z) - What is wrong with you?: Leveraging User Sentiment for Automatic Dialog
Evaluation [73.03318027164605]
We propose to use information that can be automatically extracted from the next user utterance as a proxy to measure the quality of the previous system response.
Our model generalizes across both spoken and written open-domain dialog corpora collected from real and paid users.
arXiv Detail & Related papers (2022-03-25T22:09:52Z) - Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand [117.62186420147563]
We propose a generalization of leaderboards, bidimensional leaderboards (Billboards)
Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a Billboard accepts both generators and evaluation metrics as competing entries.
We demonstrate that a linear ensemble of a few diverse metrics sometimes substantially outperforms existing metrics in isolation.
arXiv Detail & Related papers (2021-12-08T06:34:58Z) - Automating Text Naturalness Evaluation of NLG Systems [0.0]
We present an attempt to automate the evaluation of text naturalness.
Instead of relying on human participants for scoring or labeling the text samples, we propose to automate the process.
We analyze the text probability fractions and observe how they are influenced by the size of the generative and discriminative models involved in the process.
arXiv Detail & Related papers (2020-06-23T18:48:33Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z) - Human or Machine: Automating Human Likeliness Evaluation of NLG Texts [0.0]
We propose to use a human likeliness score that shows the percentage of the output samples from a method that look as if they were written by a human.
As follow up, we plan to perform an empirical analysis of human-written and machine-generated texts to find the optimal setup of this evaluation approach.
arXiv Detail & Related papers (2020-06-05T00:57:52Z) - A Human Evaluation of AMR-to-English Generation Systems [13.10463139842285]
We present the results of a new human evaluation which collects fluency and adequacy scores, as well as categorization of error types.
We discuss the relative quality of these systems and how our results compare to those of automatic metrics.
arXiv Detail & Related papers (2020-04-14T21:41:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.