Related papers: MT Metrics Correlate with Human Ratings of Simultaneous Speech Translation

MT Metrics Correlate with Human Ratings of Simultaneous Speech Translation

URL: http://arxiv.org/abs/2211.08633v2
Date: Thu, 1 Jun 2023 15:00:49 GMT
Title: MT Metrics Correlate with Human Ratings of Simultaneous Speech Translation
Authors: Dominik Mach\'a\v{c}ek, Ond\v{r}ej Bojar, Raj Dabre
Abstract summary: We conduct an extensive correlation analysis of Continuous Ratings (CR) and offline machine translation evaluation metrics. Our study reveals that the offline metrics are well correlated with CR and can be reliably used for evaluating machine translation in simultaneous mode. We conclude that given the current quality levels of SST, these metrics can be used as proxies for CR, alleviating the need for large scale human evaluation.
Score: 10.132491257235024
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: There have been several meta-evaluation studies on the correlation between human ratings and offline machine translation (MT) evaluation metrics such as BLEU, chrF2, BertScore and COMET. These metrics have been used to evaluate simultaneous speech translation (SST) but their correlations with human ratings of SST, which has been recently collected as Continuous Ratings (CR), are unclear. In this paper, we leverage the evaluations of candidate systems submitted to the English-German SST task at IWSLT 2022 and conduct an extensive correlation analysis of CR and the aforementioned metrics. Our study reveals that the offline metrics are well correlated with CR and can be reliably used for evaluating machine translation in simultaneous mode, with some limitations on the test set size. We conclude that given the current quality levels of SST, these metrics can be used as proxies for CR, alleviating the need for large scale human evaluation. Additionally, we observe that correlations of the metrics with translation as a reference is significantly higher than with simultaneous interpreting, and thus we recommend the former for reliable evaluation.

Related papers

Evaluating Text Style Transfer Evaluation: Are There Any Reliable Metrics? [9.234136424254261]
Text style transfer (TST) is the task of transforming a text to reflect a particular style while preserving its original content. Using human evaluation is ideal but costly, as is common in other natural language processing (NLP) tasks. In this paper, we examine both set of existing and novel metrics from broader NLP tasks for TST evaluation.
arXiv Detail & Related papers (2025-02-07T07:39:17Z)
Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics [46.71836180414362]
We introduce an interpretable evaluation framework for Machine Translation (MT) metrics. Within this framework, we evaluate metrics in two scenarios that serve as proxies for the data filtering and translation re-ranking use cases. We also raise concerns regarding the reliability of manually curated data following the Direct Assessments+Scalar Quality Metrics (DA+SQM) guidelines.
arXiv Detail & Related papers (2024-10-07T16:42:10Z)
Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! [80.3129093617928]
Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics. This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings. We introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness.
arXiv Detail & Related papers (2024-08-25T13:29:34Z)
MT-Ranker: Reference-free machine translation evaluation by inter-system ranking [14.188948302661933]
We show that MT-Ranker, trained without any human annotations, achieves state-of-the-art results on the WMT Shared Metrics Task benchmarks DARR20, MQM20, and MQM21. MT-Ranker marks state-of-the-art against reference-free as well as reference-based baselines.
arXiv Detail & Related papers (2024-01-30T15:30:03Z)
BLEURT Has Universal Translations: An Analysis of Automatic Metrics by Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems. We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore. In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z)
Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level. We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks. Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z)
Measuring Uncertainty in Translation Quality Evaluation (TQE) [62.997667081978825]
This work carries out motivated research to correctly estimate the confidence intervals citeBrown_etal2001Interval depending on the sample size of the translated text. The methodology we applied for this work is from Bernoulli Statistical Distribution Modelling (BSDM) and Monte Carlo Sampling Analysis (MCSA)
arXiv Detail & Related papers (2021-11-15T12:09:08Z)
On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity. This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations. We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER. We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z)
BLEU might be Guilty but References are not Innocent [34.817010352734]
We study different methods to collect references and compare their value in automated evaluation. Motivated by the finding that typical references exhibit poor diversity, concentrating around translationese language, we develop a paraphrasing task. Our method yields higher correlation with human judgment not only for the submissions of WMT 2019 English to German, but also for Back-translation and APE augmented MT output.
arXiv Detail & Related papers (2020-04-13T16:49:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.