Related papers: Adding Chocolate to Mint: Mitigating Metric Interference in Machine Translation

Adding Chocolate to Mint: Mitigating Metric Interference in Machine Translation

URL: http://arxiv.org/abs/2503.08327v1
Date: Tue, 11 Mar 2025 11:40:10 GMT
Title: Adding Chocolate to Mint: Mitigating Metric Interference in Machine Translation
Authors: José Pombal, Nuno M. Guerreiro, Ricardo Rei, André F. T. Martins,
Abstract summary: Mint can misguide practitioners into being overoptimistic about the performance of their systems.<n>We propose MintAdjust, a method for more reliable evaluation under Mint.<n>On the WMT24 MT shared task test set, MintAdjust ranks translations and systems more accurately than state-of-the-art-metrics.
Score: 24.481028155002523
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As automatic metrics become increasingly stronger and widely adopted, the risk of unintentionally "gaming the metric" during model development rises. This issue is caused by metric interference (Mint), i.e., the use of the same or related metrics for both model tuning and evaluation. Mint can misguide practitioners into being overoptimistic about the performance of their systems: as system outputs become a function of the interfering metric, their estimated quality loses correlation with human judgments. In this work, we analyze two common cases of Mint in machine translation-related tasks: filtering of training data, and decoding with quality signals. Importantly, we find that Mint strongly distorts instance-level metric scores, even when metrics are not directly optimized for -- questioning the common strategy of leveraging a different, yet related metric for evaluation that is not used for tuning. To address this problem, we propose MintAdjust, a method for more reliable evaluation under Mint. On the WMT24 MT shared task test set, MintAdjust ranks translations and systems more accurately than state-of-the-art-metrics across a majority of language pairs, especially for high-quality systems. Furthermore, MintAdjust outperforms AutoRank, the ensembling method used by the organizers.

Related papers

Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics [46.71836180414362]
We introduce an interpretable evaluation framework for Machine Translation (MT) metrics. Within this framework, we evaluate metrics in two scenarios that serve as proxies for the data filtering and translation re-ranking use cases. We also raise concerns regarding the reliability of manually curated data following the Direct Assessments+Scalar Quality Metrics (DA+SQM) guidelines.
arXiv Detail & Related papers (2024-10-07T16:42:10Z)
Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! [80.3129093617928]
Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics. This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings. We introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness.
arXiv Detail & Related papers (2024-08-25T13:29:34Z)
Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs. This dataset aims to discover whether metrics can identify 68 translation accuracy errors. We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z)
The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations. We study the impact of labeled data through in-context learning and finetuning. We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z)
Towards Multiple References Era -- Addressing Data Leakage and Limited Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks. Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations. We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z)
BLEURT Has Universal Translations: An Analysis of Automatic Metrics by Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems. We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore. In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z)
Ties Matter: Meta-Evaluating Modern Metrics with Pairwise Accuracy and Tie Calibration [31.082944145354293]
Kendall's tau is frequently used to meta-evaluate machine translation (MT) evaluation metrics score individual translations. We show that existing variants have weaknesses arising from their handling of ties, and in some situations can even be gamed. We propose instead to meta-evaluate metrics with a version of pairwise accuracy that gives metrics credit for correctly predicting ties, and a tie calibration procedure that automatically introduces ties into metric scores.
arXiv Detail & Related papers (2023-05-23T17:54:57Z)
Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level. We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks. Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z)
To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation [5.972205906525993]
We investigate which metrics have the highest accuracy to make system-level quality rankings for pairs of systems. We show that the sole use of BLEU negatively affected the past development of improved models.
arXiv Detail & Related papers (2021-07-22T17:22:22Z)
Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment. We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.