MT Metrics Correlate with Human Ratings of Simultaneous Speech
Translation
- URL: http://arxiv.org/abs/2211.08633v2
- Date: Thu, 1 Jun 2023 15:00:49 GMT
- Title: MT Metrics Correlate with Human Ratings of Simultaneous Speech
Translation
- Authors: Dominik Mach\'a\v{c}ek, Ond\v{r}ej Bojar, Raj Dabre
- Abstract summary: We conduct an extensive correlation analysis of Continuous Ratings (CR) and offline machine translation evaluation metrics.
Our study reveals that the offline metrics are well correlated with CR and can be reliably used for evaluating machine translation in simultaneous mode.
We conclude that given the current quality levels of SST, these metrics can be used as proxies for CR, alleviating the need for large scale human evaluation.
- Score: 10.132491257235024
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: There have been several meta-evaluation studies on the correlation between
human ratings and offline machine translation (MT) evaluation metrics such as
BLEU, chrF2, BertScore and COMET. These metrics have been used to evaluate
simultaneous speech translation (SST) but their correlations with human ratings
of SST, which has been recently collected as Continuous Ratings (CR), are
unclear. In this paper, we leverage the evaluations of candidate systems
submitted to the English-German SST task at IWSLT 2022 and conduct an extensive
correlation analysis of CR and the aforementioned metrics. Our study reveals
that the offline metrics are well correlated with CR and can be reliably used
for evaluating machine translation in simultaneous mode, with some limitations
on the test set size. We conclude that given the current quality levels of SST,
these metrics can be used as proxies for CR, alleviating the need for large
scale human evaluation. Additionally, we observe that correlations of the
metrics with translation as a reference is significantly higher than with
simultaneous interpreting, and thus we recommend the former for reliable
evaluation.
Related papers
- Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics [46.71836180414362]
We introduce an interpretable evaluation framework for Machine Translation (MT) metrics.
Within this framework, we evaluate metrics in two scenarios that serve as proxies for the data filtering and translation re-ranking use cases.
We also raise concerns regarding the reliability of manually curated data following the Direct Assessments+Scalar Quality Metrics (DA+SQM) guidelines.
arXiv Detail & Related papers (2024-10-07T16:42:10Z) - Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! [80.3129093617928]
Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics.
This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings.
We introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness.
arXiv Detail & Related papers (2024-08-25T13:29:34Z) - MT-Ranker: Reference-free machine translation evaluation by inter-system
ranking [14.188948302661933]
We show that MT-Ranker, trained without any human annotations, achieves state-of-the-art results on the WMT Shared Metrics Task benchmarks DARR20, MQM20, and MQM21.
MT-Ranker marks state-of-the-art against reference-free as well as reference-based baselines.
arXiv Detail & Related papers (2024-01-30T15:30:03Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - Measuring Uncertainty in Translation Quality Evaluation (TQE) [62.997667081978825]
This work carries out motivated research to correctly estimate the confidence intervals citeBrown_etal2001Interval depending on the sample size of the translated text.
The methodology we applied for this work is from Bernoulli Statistical Distribution Modelling (BSDM) and Monte Carlo Sampling Analysis (MCSA)
arXiv Detail & Related papers (2021-11-15T12:09:08Z) - On the Limitations of Cross-lingual Encoders as Exposed by
Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity.
This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations.
We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER.
We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z) - BLEU might be Guilty but References are not Innocent [34.817010352734]
We study different methods to collect references and compare their value in automated evaluation.
Motivated by the finding that typical references exhibit poor diversity, concentrating around translationese language, we develop a paraphrasing task.
Our method yields higher correlation with human judgment not only for the submissions of WMT 2019 English to German, but also for Back-translation and APE augmented MT output.
arXiv Detail & Related papers (2020-04-13T16:49:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.