Identifying Weaknesses in Machine Translation Metrics Through Minimum
Bayes Risk Decoding: A Case Study for COMET
- URL: http://arxiv.org/abs/2202.05148v1
- Date: Thu, 10 Feb 2022 17:07:32 GMT
- Title: Identifying Weaknesses in Machine Translation Metrics Through Minimum
Bayes Risk Decoding: A Case Study for COMET
- Authors: Chantal Amrhein and Rico Sennrich
- Abstract summary: We show that sample-based Minimum Bayes Risk decoding can be used to explore and quantify such weaknesses.
We further show that these biases cannot be fully removed by simply training on additional synthetic data.
- Score: 42.77140426679383
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural metrics have achieved impressive correlation with human judgements in
the evaluation of machine translation systems, but before we can safely
optimise towards such metrics, we should be aware of (and ideally eliminate)
biases towards bad translations that receive high scores. Our experiments show
that sample-based Minimum Bayes Risk decoding can be used to explore and
quantify such weaknesses. When applying this strategy to COMET for en-de and
de-en, we find that COMET models are not sensitive enough to discrepancies in
numbers and named entities. We further show that these biases cannot be fully
removed by simply training on additional synthetic data.
Related papers
- Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust
Machine Translation Evaluation [12.407789866525079]
We show that by using additional information during training, such as sentence-level features and word-level tags, the trained metrics improve their capability to penalize translations with specific troublesome phenomena.
We show that by using additional information during training, such as sentence-level features and word-level tags, the trained metrics improve their capability to penalize translations with specific troublesome phenomena.
arXiv Detail & Related papers (2023-05-30T15:50:46Z) - The Inside Story: Towards Better Understanding of Machine Translation
Neural Evaluation Metrics [8.432864879027724]
We develop and compare several neural explainability methods and demonstrate their effectiveness for interpreting state-of-the-art fine-tuned neural metrics.
Our study reveals that these metrics leverage token-level information that can be directly attributed to translation errors.
arXiv Detail & Related papers (2023-05-19T16:42:17Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - Competency-Aware Neural Machine Translation: Can Machine Translation
Know its Own Translation Quality? [61.866103154161884]
Neural machine translation (NMT) is often criticized for failures that happen without awareness.
We propose a novel competency-aware NMT by extending conventional NMT with a self-estimator.
We show that the proposed method delivers outstanding performance on quality estimation.
arXiv Detail & Related papers (2022-11-25T02:39:41Z) - Better Uncertainty Quantification for Machine Translation Evaluation [17.36759906285316]
We train the COMET metric with new heteroscedastic regression, divergence minimization, and direct uncertainty prediction objectives.
Experiments show improved results on WMT20 and WMT21 metrics task datasets and a substantial reduction in computational costs.
arXiv Detail & Related papers (2022-04-13T17:49:25Z) - Minimum Bayes Risk Decoding with Neural Metrics of Translation Quality [16.838064121696274]
This work applies Minimum Bayes Risk decoding to optimize diverse automated metrics of translation quality.
Experiments show that the combination of a neural translation model with a neural reference-based metric, BLEURT, results in significant improvement in automatic and human evaluations.
arXiv Detail & Related papers (2021-11-17T20:48:02Z) - It's Easier to Translate out of English than into it: Measuring Neural
Translation Difficulty by Cross-Mutual Information [90.35685796083563]
Cross-mutual information (XMI) is an asymmetric information-theoretic metric of machine translation difficulty.
XMI exploits the probabilistic nature of most neural machine translation models.
We present the first systematic and controlled study of cross-lingual translation difficulties using modern neural translation systems.
arXiv Detail & Related papers (2020-05-05T17:38:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.