Preliminary WMT24 Ranking of General MT Systems and LLMs
- URL: http://arxiv.org/abs/2407.19884v1
- Date: Mon, 29 Jul 2024 11:01:17 GMT
- Title: Preliminary WMT24 Ranking of General MT Systems and LLMs
- Authors: Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ondrej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Benjamin Marie, Kenton Murray, Masaaki Nagata, Martin Popel, Maja Popovic, Mariya Shmatova, Steinþór Steingrímsson, Vilém Zouhar,
- Abstract summary: This is the preliminary ranking of WMT24 General MT systems based on automatic metrics.
The official ranking will be a human evaluation, which is superior to the automatic ranking and supersedes it.
- Score: 69.82909844246127
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This is the preliminary ranking of WMT24 General MT systems based on automatic metrics. The official ranking will be a human evaluation, which is superior to the automatic ranking and supersedes it. The purpose of this report is not to interpret any findings but only provide preliminary results to the participants of the General MT task that may be useful during the writing of the system submission.
Related papers
- Preliminary Ranking of WMT25 General Machine Translation Systems [58.40564895086757]
We present the preliminary rankings of machine translation (MT) systems submitted to the WMT25 General Machine Translation Shared Task.<n>The official WMT25 ranking will be based on human evaluation, which is more reliable and will supersede these results.
arXiv Detail & Related papers (2025-08-11T17:22:31Z) - TransEvalnia: Reasoning-based Evaluation and Ranking of Translations [10.036450974576745]
We present TransEvalnia, a prompting-based translation evaluation and ranking system that uses reasoning in performing its evaluations and ranking.<n>We show that TransEvalnia performs as well as or better than the state-of-the-art MT-Ranker on our own English-Japanese data.<n>We also note the sensitivity of our system -- as well as MT-Ranker -- to the order in which the translations are presented, and we propose methods to address this position bias.
arXiv Detail & Related papers (2025-07-17T02:02:54Z) - Findings of the WMT 2024 Shared Task on Discourse-Level Literary Translation [75.03292732779059]
We focus on three language directions: Chinese-English, Chinese-German, and Chinese-Russian.
This year, we totally received 10 submissions from 5 academia and industry teams.
The official ranking of the systems is based on the overall human judgments.
arXiv Detail & Related papers (2024-12-16T12:54:52Z) - Initial Nugget Evaluation Results for the TREC 2024 RAG Track with the AutoNuggetizer Framework [53.12387628636912]
This report provides an initial look at partial results from the TREC 2024 Retrieval-Augmented Generation (RAG) Track.
We have identified RAG evaluation as a barrier to continued progress in information access.
arXiv Detail & Related papers (2024-11-14T17:25:43Z) - MetricX-24: The Google Submission to the WMT 2024 Metrics Shared Task [21.490930342296256]
We present the MetricX-24 submissions to the WMT24 Metrics Shared Task.
Our primary submission is a hybrid reference-based/free metric.
We show a significant performance increase over MetricX-23 on the WMT23 MQM ratings, as well as our new synthetic challenge set.
arXiv Detail & Related papers (2024-10-04T23:52:28Z) - Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! [80.3129093617928]
Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics.
This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings.
We introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness.
arXiv Detail & Related papers (2024-08-25T13:29:34Z) - MT-Ranker: Reference-free machine translation evaluation by inter-system
ranking [14.188948302661933]
We show that MT-Ranker, trained without any human annotations, achieves state-of-the-art results on the WMT Shared Metrics Task benchmarks DARR20, MQM20, and MQM21.
MT-Ranker marks state-of-the-art against reference-free as well as reference-based baselines.
arXiv Detail & Related papers (2024-01-30T15:30:03Z) - Findings of the WMT 2023 Shared Task on Discourse-Level Literary
Translation: A Fresh Orb in the Cosmos of LLMs [80.05205710881789]
We release a copyrighted and document-level Chinese-English web novel corpus.
This year, we totally received 14 submissions from 7 academia and industry teams.
The official ranking of the systems is based on the overall human judgments.
arXiv Detail & Related papers (2023-11-06T14:23:49Z) - Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models [57.80514758695275]
Using large language models (LLMs) for assessing the quality of machine translation (MT) achieves state-of-the-art performance at the system level.
We propose a new prompting method called textbftextttError Analysis Prompting (EAPrompt)
This technique emulates the commonly accepted human evaluation framework - Multidimensional Quality Metrics (MQM) and textitproduces explainable and reliable MT evaluations at both the system and segment level.
arXiv Detail & Related papers (2023-03-24T05:05:03Z) - Alibaba-Translate China's Submission for WMT 2022 Quality Estimation
Shared Task [80.22825549235556]
We present our submission to the sentence-level MQM benchmark at Quality Estimation Shared Task, named UniTE.
Specifically, our systems employ the framework of UniTE, which combined three types of input formats during training with a pre-trained language model.
Results show that our models reach 1st overall ranking in the Multilingual and English-Russian settings, and 2nd overall ranking in English-German and Chinese-English settings.
arXiv Detail & Related papers (2022-10-18T08:55:27Z) - An Automatic Evaluation of the WMT22 General Machine Translation Task [9.442139459221785]
It evaluates a total of 185 systems for 21 translation directions.
It highlights some of the current limits of state-of-the-art machine translation systems.
arXiv Detail & Related papers (2022-09-28T15:31:57Z) - The JHU-Microsoft Submission for WMT21 Quality Estimation Shared Task [14.629380601429956]
This paper presents the JHU-Microsoft joint submission for WMT 2021 quality estimation shared task.
We only participate in Task 2 (post-editing effort estimation) of the shared task, focusing on the target-side word-level quality estimation.
We demonstrate the competitiveness of our system compared to the widely adopted OpenKiwi-XLM baseline.
arXiv Detail & Related papers (2021-09-17T19:13:31Z) - Difficulty-Aware Machine Translation Evaluation [19.973201669851626]
We propose a novel difficulty-aware machine translation evaluation metric.
A translation that fails to be predicted by most MT systems will be treated as a difficult one and assigned a large weight in the final score function.
Our proposed method performs well even when all the MT systems are very competitive.
arXiv Detail & Related papers (2021-07-30T02:45:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.