Trained MT Metrics Learn to Cope with Machine-translated References
- URL: http://arxiv.org/abs/2312.00536v1
- Date: Fri, 1 Dec 2023 12:15:58 GMT
- Title: Trained MT Metrics Learn to Cope with Machine-translated References
- Authors: Jannis Vamvas, Tobias Domhan, Sony Trenous, Rico Sennrich and Eva
Hasler
- Abstract summary: We show that Prism+FT becomes more robust to machine-translated references.
This suggests that the effects of metric training go beyond the intended effect of improving overall correlation with human judgments.
- Score: 47.00411750716812
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural metrics trained on human evaluations of MT tend to correlate well with
human judgments, but their behavior is not fully understood. In this paper, we
perform a controlled experiment and compare a baseline metric that has not been
trained on human evaluations (Prism) to a trained version of the same metric
(Prism+FT). Surprisingly, we find that Prism+FT becomes more robust to
machine-translated references, which are a notorious problem in MT evaluation.
This suggests that the effects of metric training go beyond the intended effect
of improving overall correlation with human judgments.
Related papers
- MT-Ranker: Reference-free machine translation evaluation by inter-system
ranking [14.188948302661933]
We show that MT-Ranker, trained without any human annotations, achieves state-of-the-art results on the WMT Shared Metrics Task benchmarks DARR20, MQM20, and MQM21.
MT-Ranker marks state-of-the-art against reference-free as well as reference-based baselines.
arXiv Detail & Related papers (2024-01-30T15:30:03Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - MT Metrics Correlate with Human Ratings of Simultaneous Speech
Translation [10.132491257235024]
We conduct an extensive correlation analysis of Continuous Ratings (CR) and offline machine translation evaluation metrics.
Our study reveals that the offline metrics are well correlated with CR and can be reliably used for evaluating machine translation in simultaneous mode.
We conclude that given the current quality levels of SST, these metrics can be used as proxies for CR, alleviating the need for large scale human evaluation.
arXiv Detail & Related papers (2022-11-16T03:03:56Z) - Not All Errors are Equal: Learning Text Generation Metrics using
Stratified Error Synthesis [79.18261352971284]
We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation.
We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings.
SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
arXiv Detail & Related papers (2022-10-10T22:30:26Z) - Non-Parametric Online Learning from Human Feedback for Neural Machine
Translation [54.96594148572804]
We study the problem of online learning with human feedback in the human-in-the-loop machine translation.
Previous methods require online model updating or additional translation memory networks to achieve high-quality performance.
We propose a novel non-parametric online learning method without changing the model structure.
arXiv Detail & Related papers (2021-09-23T04:26:15Z) - Difficulty-Aware Machine Translation Evaluation [19.973201669851626]
We propose a novel difficulty-aware machine translation evaluation metric.
A translation that fails to be predicted by most MT systems will be treated as a difficult one and assigned a large weight in the final score function.
Our proposed method performs well even when all the MT systems are very competitive.
arXiv Detail & Related papers (2021-07-30T02:45:36Z) - Scientific Credibility of Machine Translation Research: A
Meta-Evaluation of 769 Papers [21.802259336894068]
This paper presents the first large-scale meta-evaluation of machine translation (MT)
We annotated MT evaluations conducted in 769 research papers published from 2010 to 2020.
arXiv Detail & Related papers (2021-06-29T09:30:17Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.