Related papers: COMET-poly: Machine Translation Metric Grounded in Other Candidates

COMET-poly: Machine Translation Metric Grounded in Other Candidates

URL: http://arxiv.org/abs/2508.18549v1
Date: Mon, 25 Aug 2025 22:55:22 GMT
Title: COMET-poly: Machine Translation Metric Grounded in Other Candidates
Authors: Maike Züfle, Vilém Zouhar, Tu Anh Dinh, Felipe Maia Polo, Jan Niehues, Mrinmaya Sachan,
Abstract summary: We propose two automated metrics that incorporate additional information beyond the single translation.<n> COMET-polycand uses alternative translations of the same source sentence to compare and contrast with the translation at hand.<n>We find that including a single additional translation in COMET-polycand improves the segment-level metric performance.
Score: 63.82506348745169
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automated metrics for machine translation attempt to replicate human judgment. Unlike humans, who often assess a translation in the context of multiple alternatives, these metrics typically consider only the source sentence and a single translation. This discrepancy in the evaluation setup may negatively impact the performance of automated metrics. We propose two automated metrics that incorporate additional information beyond the single translation. COMET-polycand uses alternative translations of the same source sentence to compare and contrast with the translation at hand, thereby providing a more informed assessment of its quality. COMET-polyic, inspired by retrieval-based in-context learning, takes in translations of similar source texts along with their human-labeled quality scores to guide the evaluation. We find that including a single additional translation in COMET-polycand improves the segment-level metric performance (0.079 to 0.118 Kendall's tau-b correlation), with further gains when more translations are added. Incorporating retrieved examples in COMET-polyic yields similar improvements (0.079 to 0.116 Kendall's tau-b correlation). We release our models publicly.

Related papers

Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation [57.11989521509119]
We propose a novel agentic translation evaluation framework, centered by a reflective Core Agent that invokes specialized sub-agents.<n> Experimental results indicate the efficacy of RATE, achieving an improvement of at least 3.2 meta score compared with current metrics.
arXiv Detail & Related papers (2026-01-12T09:03:42Z)
Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering [68.3400058037817]
We introduce TREQA (Translation Evaluation via Question-Answering), a framework that extrinsically evaluates translation quality.<n>We show that TREQA is competitive with and, in some cases, outperforms state-of-the-art neural and LLM-based metrics in ranking alternative paragraph-level translations.
arXiv Detail & Related papers (2025-04-10T09:24:54Z)
Evaluating Automatic Metrics with Incremental Machine Translation Systems [55.78547133890403]
We introduce a dataset comprising commercial machine translations, gathered weekly over six years across 12 translation directions. We assume commercial systems improve over time, which enables us to evaluate machine translation (MT) metrics based on their preference for more recent translations.
arXiv Detail & Related papers (2024-07-03T17:04:17Z)
Textual Similarity as a Key Metric in Machine Translation Quality Estimation [27.152245569974678]
Machine Translation (MT) Quality Estimation (QE) assesses translation reliability without reference texts. This study introduces "textual similarity" as a new metric for QE, using sentence transformers and cosine similarity to measure semantic closeness.
arXiv Detail & Related papers (2024-06-11T16:48:17Z)
Robustness Tests for Automatic Machine Translation Metrics with Adversarial Attacks [39.86206454559138]
We experiment with word- and character-level attacks on three popular machine translation metrics: BERTScore, BLEURT, and COMET. Our human experiments validate that automatic metrics tend to overpenalize adversarially-degraded translations. We identify patterns of brittleness that motivate more robust metric development.
arXiv Detail & Related papers (2023-11-01T13:14:23Z)
BLEURT Has Universal Translations: An Analysis of Automatic Metrics by Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems. We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore. In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z)
Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level. We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks. Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z)
NMTScore: A Multilingual Analysis of Translation-based Text Similarity Measures [42.46681912294797]
We analyze translation-based similarity measures in the common framework of multilingual NMT. Compared to baselines such as sentence embeddings, translation-based measures prove competitive in paraphrase identification. Measures show a relatively high correlation to human judgments.
arXiv Detail & Related papers (2022-04-28T17:57:17Z)
BLEU might be Guilty but References are not Innocent [34.817010352734]
We study different methods to collect references and compare their value in automated evaluation. Motivated by the finding that typical references exhibit poor diversity, concentrating around translationese language, we develop a paraphrasing task. Our method yields higher correlation with human judgment not only for the submissions of WMT 2019 English to German, but also for Back-translation and APE augmented MT output.
arXiv Detail & Related papers (2020-04-13T16:49:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.