Uncertainty-Aware Machine Translation Evaluation
- URL: http://arxiv.org/abs/2109.06352v1
- Date: Mon, 13 Sep 2021 22:46:03 GMT
- Title: Uncertainty-Aware Machine Translation Evaluation
- Authors: Taisiya Glushkova, Chrysoula Zerva, Ricardo Rei, Andr\'e F. T. Martins
- Abstract summary: We introduce uncertainty-aware MT evaluation and analyze the trustworthiness of the predicted quality.
We compare the performance of our uncertainty-aware MT evaluation methods across multiple language pairs from the QT21 dataset and the WMT20 metrics task.
- Score: 0.716879432974126
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Several neural-based metrics have been recently proposed to evaluate machine
translation quality. However, all of them resort to point estimates, which
provide limited information at segment level. This is made worse as they are
trained on noisy, biased and scarce human judgements, often resulting in
unreliable quality predictions. In this paper, we introduce uncertainty-aware
MT evaluation and analyze the trustworthiness of the predicted quality. We
combine the COMET framework with two uncertainty estimation methods, Monte
Carlo dropout and deep ensembles, to obtain quality scores along with
confidence intervals. We compare the performance of our uncertainty-aware MT
evaluation methods across multiple language pairs from the QT21 dataset and the
WMT20 metrics task, augmented with MQM annotations. We experiment with varying
numbers of references and further discuss the usefulness of uncertainty-aware
quality estimation (without references) to flag possibly critical translation
mistakes.
Related papers
- Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics [46.71836180414362]
We introduce an interpretable evaluation framework for Machine Translation (MT) metrics.
Within this framework, we evaluate metrics in two scenarios that serve as proxies for the data filtering and translation re-ranking use cases.
We also raise concerns regarding the reliability of manually curated data following the Direct Assessments+Scalar Quality Metrics (DA+SQM) guidelines.
arXiv Detail & Related papers (2024-10-07T16:42:10Z) - Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! [80.3129093617928]
Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics.
This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings.
We introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness.
arXiv Detail & Related papers (2024-08-25T13:29:34Z) - Can Automatic Metrics Assess High-Quality Translations? [28.407966066693334]
We show that current metrics are insensitive to nuanced differences in translation quality.
This effect is most pronounced when the quality is high and the variance among alternatives is low.
Using the MQM framework as the gold standard, we systematically stress-test the ability of current metrics to identify translations with no errors as marked by humans.
arXiv Detail & Related papers (2024-05-28T16:44:02Z) - Multi-Dimensional Machine Translation Evaluation: Model Evaluation and Resource for Korean [7.843029855730508]
We develop a 1200-sentence MQM evaluation benchmark for the language pair English-Korean.
We find that reference-free setup outperforms its counterpart in the style dimension.
Overall, RemBERT emerges as the most promising model.
arXiv Detail & Related papers (2024-03-19T12:02:38Z) - Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models [57.80514758695275]
Using large language models (LLMs) for assessing the quality of machine translation (MT) achieves state-of-the-art performance at the system level.
We propose a new prompting method called textbftextttError Analysis Prompting (EAPrompt)
This technique emulates the commonly accepted human evaluation framework - Multidimensional Quality Metrics (MQM) and textitproduces explainable and reliable MT evaluations at both the system and segment level.
arXiv Detail & Related papers (2023-03-24T05:05:03Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - Competency-Aware Neural Machine Translation: Can Machine Translation
Know its Own Translation Quality? [61.866103154161884]
Neural machine translation (NMT) is often criticized for failures that happen without awareness.
We propose a novel competency-aware NMT by extending conventional NMT with a self-estimator.
We show that the proposed method delivers outstanding performance on quality estimation.
arXiv Detail & Related papers (2022-11-25T02:39:41Z) - Better Uncertainty Quantification for Machine Translation Evaluation [17.36759906285316]
We train the COMET metric with new heteroscedastic regression, divergence minimization, and direct uncertainty prediction objectives.
Experiments show improved results on WMT20 and WMT21 metrics task datasets and a substantial reduction in computational costs.
arXiv Detail & Related papers (2022-04-13T17:49:25Z) - Measuring Uncertainty in Translation Quality Evaluation (TQE) [62.997667081978825]
This work carries out motivated research to correctly estimate the confidence intervals citeBrown_etal2001Interval depending on the sample size of the translated text.
The methodology we applied for this work is from Bernoulli Statistical Distribution Modelling (BSDM) and Monte Carlo Sampling Analysis (MCSA)
arXiv Detail & Related papers (2021-11-15T12:09:08Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.