MetricX-24: The Google Submission to the WMT 2024 Metrics Shared Task
- URL: http://arxiv.org/abs/2410.03983v1
- Date: Fri, 4 Oct 2024 23:52:28 GMT
- Title: MetricX-24: The Google Submission to the WMT 2024 Metrics Shared Task
- Authors: Juraj Juraska, Daniel Deutsch, Mara Finkelstein, Markus Freitag,
- Abstract summary: We present the MetricX-24 submissions to the WMT24 Metrics Shared Task.
Our primary submission is a hybrid reference-based/free metric.
We show a significant performance increase over MetricX-23 on the WMT23 MQM ratings, as well as our new synthetic challenge set.
- Score: 21.490930342296256
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present the MetricX-24 submissions to the WMT24 Metrics Shared Task and provide details on the improvements we made over the previous version of MetricX. Our primary submission is a hybrid reference-based/-free metric, which can score a translation irrespective of whether it is given the source segment, the reference, or both. The metric is trained on previous WMT data in a two-stage fashion, first on the DA ratings only, then on a mixture of MQM and DA ratings. The training set in both stages is augmented with synthetic examples that we created to make the metric more robust to several common failure modes, such as fluent but unrelated translation, or undertranslation. We demonstrate the benefits of the individual modifications via an ablation study, and show a significant performance increase over MetricX-23 on the WMT23 MQM ratings, as well as our new synthetic challenge set.
Related papers
- MetaMetrics-MT: Tuning Meta-Metrics for Machine Translation via Human Preference Calibration [14.636927775315783]
We present MetaMetrics-MT, an innovative metric designed to evaluate machine translation (MT) tasks by aligning closely with human preferences.
Our experiments on the WMT24 metric shared task dataset demonstrate that MetaMetrics-MT outperforms all existing baselines.
arXiv Detail & Related papers (2024-11-01T06:34:30Z) - Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! [80.3129093617928]
Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics.
This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings.
We introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness.
arXiv Detail & Related papers (2024-08-25T13:29:34Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - ACES: Translation Accuracy Challenge Sets at WMT 2023 [7.928752019133836]
We benchmark the performance of segmentlevel metrics submitted to WMT 2023 using the ACES Challenge Set.
The challenge set consists of 36K examples representing challenges from 68 phenomena and covering 146 language pairs.
For each metric, we provide a detailed profile of performance over a range of error categories as well as an overall ACES-Score for quick comparison.
arXiv Detail & Related papers (2023-11-02T11:29:09Z) - Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models [57.80514758695275]
Using large language models (LLMs) for assessing the quality of machine translation (MT) achieves state-of-the-art performance at the system level.
We propose a new prompting method called textbftextttError Analysis Prompting (EAPrompt)
This technique emulates the commonly accepted human evaluation framework - Multidimensional Quality Metrics (MQM) and textitproduces explainable and reliable MT evaluations at both the system and segment level.
arXiv Detail & Related papers (2023-03-24T05:05:03Z) - Alibaba-Translate China's Submission for WMT 2022 Quality Estimation
Shared Task [80.22825549235556]
We present our submission to the sentence-level MQM benchmark at Quality Estimation Shared Task, named UniTE.
Specifically, our systems employ the framework of UniTE, which combined three types of input formats during training with a pre-trained language model.
Results show that our models reach 1st overall ranking in the Multilingual and English-Russian settings, and 2nd overall ranking in English-German and Chinese-English settings.
arXiv Detail & Related papers (2022-10-18T08:55:27Z) - Alibaba-Translate China's Submission for WMT 2022 Metrics Shared Task [61.34108034582074]
We build our system based on the core idea of UNITE (Unified Translation Evaluation)
During the model pre-training phase, we first apply the pseudo-labeled data examples to continuously pre-train UNITE.
During the fine-tuning phase, we use both Direct Assessment (DA) and Multidimensional Quality Metrics (MQM) data from past years' WMT competitions.
arXiv Detail & Related papers (2022-10-18T08:51:25Z) - Embarrassingly Easy Document-Level MT Metrics: How to Convert Any
Pretrained Metric Into a Document-Level Metric [15.646714712131148]
We present a method for extending pretrained metrics to incorporate context at the document level.
We show that the extended metrics outperform their sentence-level counterparts in about 85% of the tested conditions.
Our experimental results support our initial hypothesis and show that a simple extension of the metrics permits them to take advantage of context.
arXiv Detail & Related papers (2022-09-27T19:42:22Z) - QAFactEval: Improved QA-Based Factual Consistency Evaluation for
Summarization [116.56171113972944]
We show that carefully choosing the components of a QA-based metric is critical to performance.
Our solution improves upon the best-performing entailment-based metric and achieves state-of-the-art performance.
arXiv Detail & Related papers (2021-12-16T00:38:35Z) - The JHU-Microsoft Submission for WMT21 Quality Estimation Shared Task [14.629380601429956]
This paper presents the JHU-Microsoft joint submission for WMT 2021 quality estimation shared task.
We only participate in Task 2 (post-editing effort estimation) of the shared task, focusing on the target-side word-level quality estimation.
We demonstrate the competitiveness of our system compared to the widely adopted OpenKiwi-XLM baseline.
arXiv Detail & Related papers (2021-09-17T19:13:31Z) - Learning to Evaluate Translation Beyond English: BLEURT Submissions to
the WMT Metrics 2020 Shared Task [30.889496911261677]
This paper describes our contribution to the WMT 2020 Metrics Shared Task.
We make several submissions based on BLEURT, a metric based on transfer learning.
We show how to combine BLEURT's predictions with those of YiSi and use alternative reference translations to enhance the performance.
arXiv Detail & Related papers (2020-10-08T23:16:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.