Scientific Credibility of Machine Translation Research: A
Meta-Evaluation of 769 Papers
- URL: http://arxiv.org/abs/2106.15195v1
- Date: Tue, 29 Jun 2021 09:30:17 GMT
- Title: Scientific Credibility of Machine Translation Research: A
Meta-Evaluation of 769 Papers
- Authors: Benjamin Marie, Atsushi Fujita, Raphael Rubino
- Abstract summary: This paper presents the first large-scale meta-evaluation of machine translation (MT)
We annotated MT evaluations conducted in 769 research papers published from 2010 to 2020.
- Score: 21.802259336894068
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents the first large-scale meta-evaluation of machine
translation (MT). We annotated MT evaluations conducted in 769 research papers
published from 2010 to 2020. Our study shows that practices for automatic MT
evaluation have dramatically changed during the past decade and follow
concerning trends. An increasing number of MT evaluations exclusively rely on
differences between BLEU scores to draw conclusions, without performing any
kind of statistical significance testing nor human evaluation, while at least
108 metrics claiming to be better than BLEU have been proposed. MT evaluations
in recent papers tend to copy and compare automatic metric scores from previous
work to claim the superiority of a method or an algorithm without confirming
neither exactly the same training, validating, and testing data have been used
nor the metric scores are comparable. Furthermore, tools for reporting
standardized metric scores are still far from being widely adopted by the MT
community. After showing how the accumulation of these pitfalls leads to
dubious evaluation, we propose a guideline to encourage better automatic MT
evaluation along with a simple meta-evaluation scoring method to assess its
credibility.
Related papers
- A Comparative Study of Quality Evaluation Methods for Text Summarization [0.5512295869673147]
This paper proposes a novel method based on large language models (LLMs) for evaluating text summarization.
Our results show that LLMs evaluation aligns closely with human evaluation, while widely-used automatic metrics such as ROUGE-2, BERTScore, and SummaC do not and also lack consistency.
arXiv Detail & Related papers (2024-06-30T16:12:37Z) - MT-Ranker: Reference-free machine translation evaluation by inter-system
ranking [14.188948302661933]
We show that MT-Ranker, trained without any human annotations, achieves state-of-the-art results on the WMT Shared Metrics Task benchmarks DARR20, MQM20, and MQM21.
MT-Ranker marks state-of-the-art against reference-free as well as reference-based baselines.
arXiv Detail & Related papers (2024-01-30T15:30:03Z) - Trained MT Metrics Learn to Cope with Machine-translated References [47.00411750716812]
We show that Prism+FT becomes more robust to machine-translated references.
This suggests that the effects of metric training go beyond the intended effect of improving overall correlation with human judgments.
arXiv Detail & Related papers (2023-12-01T12:15:58Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - Social Biases in Automatic Evaluation Metrics for NLG [53.76118154594404]
We propose an evaluation method based on Word Embeddings Association Test (WEAT) and Sentence Embeddings Association Test (SEAT) to quantify social biases in evaluation metrics.
We construct gender-swapped meta-evaluation datasets to explore the potential impact of gender bias in image caption and text summarization tasks.
arXiv Detail & Related papers (2022-10-17T08:55:26Z) - An Overview on Machine Translation Evaluation [6.85316573653194]
Machine translation (MT) has become one of the important tasks of AI and development.
The evaluation task of MT is not only to evaluate the quality of machine translation, but also to give timely feedback to machine translation researchers.
This report mainly includes a brief history of machine translation evaluation (MTE), the classification of research methods on MTE, and the the cutting-edge progress.
arXiv Detail & Related papers (2022-02-22T16:58:28Z) - Uncertainty-Aware Machine Translation Evaluation [0.716879432974126]
We introduce uncertainty-aware MT evaluation and analyze the trustworthiness of the predicted quality.
We compare the performance of our uncertainty-aware MT evaluation methods across multiple language pairs from the QT21 dataset and the WMT20 metrics task.
arXiv Detail & Related papers (2021-09-13T22:46:03Z) - Difficulty-Aware Machine Translation Evaluation [19.973201669851626]
We propose a novel difficulty-aware machine translation evaluation metric.
A translation that fails to be predicted by most MT systems will be treated as a difficult one and assigned a large weight in the final score function.
Our proposed method performs well even when all the MT systems are very competitive.
arXiv Detail & Related papers (2021-07-30T02:45:36Z) - A Statistical Analysis of Summarization Evaluation Metrics using
Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are.
Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z) - PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative
Dialogue Systems [48.99561874529323]
There are three kinds of automatic methods to evaluate the open-domain generative dialogue systems.
Due to the lack of systematic comparison, it is not clear which kind of metrics are more effective.
We propose a novel and feasible learning-based metric that can significantly improve the correlation with human judgments.
arXiv Detail & Related papers (2020-04-06T04:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.