An Automatic Evaluation of the WMT22 General Machine Translation Task
- URL: http://arxiv.org/abs/2209.14172v1
- Date: Wed, 28 Sep 2022 15:31:57 GMT
- Title: An Automatic Evaluation of the WMT22 General Machine Translation Task
- Authors: Benjamin Marie
- Abstract summary: It evaluates a total of 185 systems for 21 translation directions.
It highlights some of the current limits of state-of-the-art machine translation systems.
- Score: 9.442139459221785
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This report presents an automatic evaluation of the general machine
translation task of the Seventh Conference on Machine Translation (WMT22). It
evaluates a total of 185 systems for 21 translation directions including
high-resource to low-resource language pairs and from closely related to
distant languages. This large-scale automatic evaluation highlights some of the
current limits of state-of-the-art machine translation systems. It also shows
how automatic metrics, namely chrF, BLEU, and COMET, can complement themselves
to mitigate their own limits in terms of interpretability and accuracy.
Related papers
- Preliminary Ranking of WMT25 General Machine Translation Systems [58.40564895086757]
We present the preliminary rankings of machine translation (MT) systems submitted to the WMT25 General Machine Translation Shared Task.<n>The official WMT25 ranking will be based on human evaluation, which is more reliable and will supersede these results.
arXiv Detail & Related papers (2025-08-11T17:22:31Z) - MT-LENS: An all-in-one Toolkit for Better Machine Translation Evaluation [1.7775825387442485]
MT-LENS is a framework designed to evaluate Machine Translation (MT) systems across a variety of tasks.<n>It offers a user-friendly platform to compare systems and analyze translations with interactive visualizations.
arXiv Detail & Related papers (2024-12-16T09:57:28Z) - Modeling User Preferences with Automatic Metrics: Creating a High-Quality Preference Dataset for Machine Translation [18.077562738603792]
We propose an approach that leverages the best of both worlds.
We first collect sentence-level quality assessments from professional linguists on translations generated by multiple high-quality MT systems.
We then use this analysis to curate a new dataset, MT-Pref, which comprises 18k instances covering 18 language directions.
arXiv Detail & Related papers (2024-10-10T10:09:54Z) - HW-TSC's Submission to the CCMT 2024 Machine Translation Tasks [12.841065384808733]
We participate in the bilingual machine translation task and multi-domain machine translation task.
For these two translation tasks, we use training strategies such as regularized dropout, bidirectional training, data diversification, forward translation, back translation, alternated training, curriculum learning, and transductive ensemble learning.
arXiv Detail & Related papers (2024-09-23T09:20:19Z) - Preliminary WMT24 Ranking of General MT Systems and LLMs [69.82909844246127]
This is the preliminary ranking of WMT24 General MT systems based on automatic metrics.
The official ranking will be a human evaluation, which is superior to the automatic ranking and supersedes it.
arXiv Detail & Related papers (2024-07-29T11:01:17Z) - Convergences and Divergences between Automatic Assessment and Human Evaluation: Insights from Comparing ChatGPT-Generated Translation and Neural Machine Translation [1.6982207802596105]
This study investigates the convergences and divergences between automated metrics and human evaluation.
To perform automatic assessment, four automated metrics are employed, while human evaluation incorporates the DQF-MQM error typology and six rubrics.
Results underscore the indispensable role of human judgment in evaluating the performance of advanced translation tools.
arXiv Detail & Related papers (2024-01-10T14:20:33Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - Evaluating the Efficacy of Length-Controllable Machine Translation [38.672519854291174]
This work is the first attempt to evaluate the automatic metrics for length-controllable machine translation tasks systematically.
We conduct a rigorous human evaluation on two translation directions and evaluate 18 summarization or translation evaluation metrics.
We find that BLEURT and COMET have the highest correlation with human evaluation and are most suitable as evaluation metrics for length-controllable machine translation.
arXiv Detail & Related papers (2023-05-03T17:50:33Z) - Towards Interpretable and Efficient Automatic Reference-Based
Summarization Evaluation [160.07938471250048]
Interpretability and efficiency are two important considerations for the adoption of neural automatic metrics.
We develop strong-performing automatic metrics for reference-based summarization evaluation.
arXiv Detail & Related papers (2023-03-07T02:49:50Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - QEMind: Alibaba's Submission to the WMT21 Quality Estimation Shared Task [24.668012925628968]
We present our submissions to the WMT 2021 QE shared task.
We propose several useful features to evaluate the uncertainty of the translations to build our QE system, named textitQEMind.
We show that our multilingual systems outperform the best system in the Direct Assessment QE task of WMT 2020.
arXiv Detail & Related papers (2021-12-30T02:27:29Z) - Multilingual Machine Translation Systems from Microsoft for WMT21 Shared
Task [95.06453182273027]
This report describes Microsoft's machine translation systems for the WMT21 shared task on large-scale multilingual machine translation.
Our model submissions to the shared task were with DeltaLMnotefooturlhttps://aka.ms/deltalm, a generic pre-trained multilingual-decoder model.
Our final submissions ranked first on three tracks in terms of the automatic evaluation metric.
arXiv Detail & Related papers (2021-11-03T09:16:17Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.