Trained MT Metrics Learn to Cope with Machine-translated References
- URL: http://arxiv.org/abs/2312.00536v1
- Date: Fri, 1 Dec 2023 12:15:58 GMT
- Title: Trained MT Metrics Learn to Cope with Machine-translated References
- Authors: Jannis Vamvas, Tobias Domhan, Sony Trenous, Rico Sennrich and Eva
Hasler
- Abstract summary: We show that Prism+FT becomes more robust to machine-translated references.
This suggests that the effects of metric training go beyond the intended effect of improving overall correlation with human judgments.
- Score: 47.00411750716812
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural metrics trained on human evaluations of MT tend to correlate well with
human judgments, but their behavior is not fully understood. In this paper, we
perform a controlled experiment and compare a baseline metric that has not been
trained on human evaluations (Prism) to a trained version of the same metric
(Prism+FT). Surprisingly, we find that Prism+FT becomes more robust to
machine-translated references, which are a notorious problem in MT evaluation.
This suggests that the effects of metric training go beyond the intended effect
of improving overall correlation with human judgments.
Related papers
- Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress [43.09028349076039]
In Machine Translation (MT) evaluation, metric performance is assessed based on agreement with human judgments.<n>We incorporate human baselines in the MT meta-evaluation, that is, the assessment of MT metrics' capabilities.<n>Our results show that human annotators are not consistently superior to automatic metrics, with state-of-the-art metrics often ranking on par with or higher than human baselines.
arXiv Detail & Related papers (2025-06-24T12:35:00Z) - An Analysis on Automated Metrics for Evaluating Japanese-English Chat Translation [0.0]
We show that for ranking NMT models in chat translations, all metrics seem consistent in deciding which model outperforms the others.
On the other hand, neural-based metrics outperform traditional metrics, with COMET achieving the highest correlation with the human-annotated score on a chat translation.
arXiv Detail & Related papers (2024-12-24T05:54:40Z) - Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics [46.71836180414362]
We introduce an interpretable evaluation framework for Machine Translation (MT) metrics.
Within this framework, we evaluate metrics in two scenarios that serve as proxies for the data filtering and translation re-ranking use cases.
We also raise concerns regarding the reliability of manually curated data following the Direct Assessments+Scalar Quality Metrics (DA+SQM) guidelines.
arXiv Detail & Related papers (2024-10-07T16:42:10Z) - Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! [80.3129093617928]
Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics.
This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings.
We introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness.
arXiv Detail & Related papers (2024-08-25T13:29:34Z) - Evaluating Automatic Metrics with Incremental Machine Translation Systems [55.78547133890403]
We introduce a dataset comprising commercial machine translations, gathered weekly over six years across 12 translation directions.
We assume commercial systems improve over time, which enables us to evaluate machine translation (MT) metrics based on their preference for more recent translations.
arXiv Detail & Related papers (2024-07-03T17:04:17Z) - MT-Ranker: Reference-free machine translation evaluation by inter-system
ranking [14.188948302661933]
We show that MT-Ranker, trained without any human annotations, achieves state-of-the-art results on the WMT Shared Metrics Task benchmarks DARR20, MQM20, and MQM21.
MT-Ranker marks state-of-the-art against reference-free as well as reference-based baselines.
arXiv Detail & Related papers (2024-01-30T15:30:03Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - MT Metrics Correlate with Human Ratings of Simultaneous Speech
Translation [10.132491257235024]
We conduct an extensive correlation analysis of Continuous Ratings (CR) and offline machine translation evaluation metrics.
Our study reveals that the offline metrics are well correlated with CR and can be reliably used for evaluating machine translation in simultaneous mode.
We conclude that given the current quality levels of SST, these metrics can be used as proxies for CR, alleviating the need for large scale human evaluation.
arXiv Detail & Related papers (2022-11-16T03:03:56Z) - Difficulty-Aware Machine Translation Evaluation [19.973201669851626]
We propose a novel difficulty-aware machine translation evaluation metric.
A translation that fails to be predicted by most MT systems will be treated as a difficult one and assigned a large weight in the final score function.
Our proposed method performs well even when all the MT systems are very competitive.
arXiv Detail & Related papers (2021-07-30T02:45:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.