Related papers: Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers

Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers

URL: http://arxiv.org/abs/2106.15195v1
Date: Tue, 29 Jun 2021 09:30:17 GMT
Title: Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers
Authors: Benjamin Marie, Atsushi Fujita, Raphael Rubino
Abstract summary: This paper presents the first large-scale meta-evaluation of machine translation (MT) We annotated MT evaluations conducted in 769 research papers published from 2010 to 2020.
Score: 21.802259336894068
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents the first large-scale meta-evaluation of machine translation (MT). We annotated MT evaluations conducted in 769 research papers published from 2010 to 2020. Our study shows that practices for automatic MT evaluation have dramatically changed during the past decade and follow concerning trends. An increasing number of MT evaluations exclusively rely on differences between BLEU scores to draw conclusions, without performing any kind of statistical significance testing nor human evaluation, while at least 108 metrics claiming to be better than BLEU have been proposed. MT evaluations in recent papers tend to copy and compare automatic metric scores from previous work to claim the superiority of a method or an algorithm without confirming neither exactly the same training, validating, and testing data have been used nor the metric scores are comparable. Furthermore, tools for reporting standardized metric scores are still far from being widely adopted by the MT community. After showing how the accumulation of these pitfalls leads to dubious evaluation, we propose a guideline to encourage better automatic MT evaluation along with a simple meta-evaluation scoring method to assess its credibility.

Related papers

LLMs Are Not Scorers: Rethinking MT Evaluation with Generation-Based Methods [0.0]
We propose a generation-based evaluation paradigm that leverages decoder-only language models to produce high-quality references.<n> Empirical results show that our method outperforms both intra-LLM direct scoring baselines and external non-LLM reference-free metrics from MTME.
arXiv Detail & Related papers (2025-05-22T02:14:38Z)
Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics [46.71836180414362]
We introduce an interpretable evaluation framework for Machine Translation (MT) metrics. Within this framework, we evaluate metrics in two scenarios that serve as proxies for the data filtering and translation re-ranking use cases. We also raise concerns regarding the reliability of manually curated data following the Direct Assessments+Scalar Quality Metrics (DA+SQM) guidelines.
arXiv Detail & Related papers (2024-10-07T16:42:10Z)
Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! [80.3129093617928]
Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics. This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings. We introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness.
arXiv Detail & Related papers (2024-08-25T13:29:34Z)
Evaluating Automatic Metrics with Incremental Machine Translation Systems [55.78547133890403]
We introduce a dataset comprising commercial machine translations, gathered weekly over six years across 12 translation directions. We assume commercial systems improve over time, which enables us to evaluate machine translation (MT) metrics based on their preference for more recent translations.
arXiv Detail & Related papers (2024-07-03T17:04:17Z)
A Comparative Study of Quality Evaluation Methods for Text Summarization [0.5512295869673147]
This paper proposes a novel method based on large language models (LLMs) for evaluating text summarization. Our results show that LLMs evaluation aligns closely with human evaluation, while widely-used automatic metrics such as ROUGE-2, BERTScore, and SummaC do not and also lack consistency.
arXiv Detail & Related papers (2024-06-30T16:12:37Z)
MT-Ranker: Reference-free machine translation evaluation by inter-system ranking [14.188948302661933]
We show that MT-Ranker, trained without any human annotations, achieves state-of-the-art results on the WMT Shared Metrics Task benchmarks DARR20, MQM20, and MQM21. MT-Ranker marks state-of-the-art against reference-free as well as reference-based baselines.
arXiv Detail & Related papers (2024-01-30T15:30:03Z)
BLEURT Has Universal Translations: An Analysis of Automatic Metrics by Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems. We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore. In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z)
Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models [57.80514758695275]
Using large language models (LLMs) for assessing the quality of machine translation (MT) achieves state-of-the-art performance at the system level. We propose a new prompting method called textbftextttError Analysis Prompting (EAPrompt) This technique emulates the commonly accepted human evaluation framework - Multidimensional Quality Metrics (MQM) and textitproduces explainable and reliable MT evaluations at both the system and segment level.
arXiv Detail & Related papers (2023-03-24T05:05:03Z)
An Overview on Machine Translation Evaluation [6.85316573653194]
Machine translation (MT) has become one of the important tasks of AI and development. The evaluation task of MT is not only to evaluate the quality of machine translation, but also to give timely feedback to machine translation researchers. This report mainly includes a brief history of machine translation evaluation (MTE), the classification of research methods on MTE, and the the cutting-edge progress.
arXiv Detail & Related papers (2022-02-22T16:58:28Z)
A Statistical Analysis of Summarization Evaluation Metrics using Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are. Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z)
Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment. We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.