Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress
- URL: http://arxiv.org/abs/2506.19571v1
- Date: Tue, 24 Jun 2025 12:35:00 GMT
- Title: Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress
- Authors: Lorenzo Proietti, Stefano Perrella, Roberto Navigli,
- Abstract summary: In Machine Translation (MT) evaluation, metric performance is assessed based on agreement with human judgments.<n>We incorporate human baselines in the MT meta-evaluation, that is, the assessment of MT metrics' capabilities.<n>Our results show that human annotators are not consistently superior to automatic metrics, with state-of-the-art metrics often ranking on par with or higher than human baselines.
- Score: 43.09028349076039
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In Machine Translation (MT) evaluation, metric performance is assessed based on agreement with human judgments. In recent years, automatic metrics have demonstrated increasingly high levels of agreement with humans. To gain a clearer understanding of metric performance and establish an upper bound, we incorporate human baselines in the MT meta-evaluation, that is, the assessment of MT metrics' capabilities. Our results show that human annotators are not consistently superior to automatic metrics, with state-of-the-art metrics often ranking on par with or higher than human baselines. Despite these findings suggesting human parity, we discuss several reasons for caution. Finally, we explore the broader implications of our results for the research field, asking: Can we still reliably measure improvements in MT evaluation? With this work, we aim to shed light on the limits of our ability to measure progress in the field, fostering discussion on an issue that we believe is crucial to the entire MT evaluation community.
Related papers
- Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics [46.71836180414362]
We introduce an interpretable evaluation framework for Machine Translation (MT) metrics.
Within this framework, we evaluate metrics in two scenarios that serve as proxies for the data filtering and translation re-ranking use cases.
We also raise concerns regarding the reliability of manually curated data following the Direct Assessments+Scalar Quality Metrics (DA+SQM) guidelines.
arXiv Detail & Related papers (2024-10-07T16:42:10Z) - Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! [80.3129093617928]
Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics.
This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings.
We introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness.
arXiv Detail & Related papers (2024-08-25T13:29:34Z) - Convergences and Divergences between Automatic Assessment and Human Evaluation: Insights from Comparing ChatGPT-Generated Translation and Neural Machine Translation [1.6982207802596105]
This study investigates the convergences and divergences between automated metrics and human evaluation.
To perform automatic assessment, four automated metrics are employed, while human evaluation incorporates the DQF-MQM error typology and six rubrics.
Results underscore the indispensable role of human judgment in evaluating the performance of advanced translation tools.
arXiv Detail & Related papers (2024-01-10T14:20:33Z) - Trained MT Metrics Learn to Cope with Machine-translated References [47.00411750716812]
We show that Prism+FT becomes more robust to machine-translated references.
This suggests that the effects of metric training go beyond the intended effect of improving overall correlation with human judgments.
arXiv Detail & Related papers (2023-12-01T12:15:58Z) - Text Style Transfer Evaluation Using Large Language Models [24.64611983641699]
Large Language Models (LLMs) have shown their capacity to match and even exceed average human performance.
We compare the results of different LLMs in TST using multiple input prompts.
Our findings highlight a strong correlation between (even zero-shot) prompting and human evaluation, showing that LLMs often outperform traditional automated metrics.
arXiv Detail & Related papers (2023-08-25T13:07:33Z) - INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained
Feedback [80.57617091714448]
We present InstructScore, an explainable evaluation metric for text generation.
We fine-tune a text evaluation metric based on LLaMA, producing a score for generated text and a human readable diagnostic report.
arXiv Detail & Related papers (2023-05-23T17:27:22Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - The Glass Ceiling of Automatic Evaluation in Natural Language Generation [60.59732704936083]
We take a step back and analyze recent progress by comparing the body of existing automatic metrics and human metrics.
Our extensive statistical analysis reveals surprising findings: automatic metrics -- old and new -- are much more similar to each other than to humans.
arXiv Detail & Related papers (2022-08-31T01:13:46Z) - Experts, Errors, and Context: A Large-Scale Study of Human Evaluation
for Machine Translation [19.116396693370422]
We propose an evaluation methodology grounded in explicit error analysis, based on the Multidimensional Quality Metrics framework.
We carry out the largest MQM research study to date, scoring the outputs of top systems from the WMT 2020 shared task in two language pairs.
We analyze the resulting data extensively, finding among other results a substantially different ranking of evaluated systems from the one established by the WMT crowd workers.
arXiv Detail & Related papers (2021-04-29T16:42:09Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.