Extrinsic Evaluation of Machine Translation Metrics
- URL: http://arxiv.org/abs/2212.10297v2
- Date: Mon, 19 Jun 2023 03:59:55 GMT
- Title: Extrinsic Evaluation of Machine Translation Metrics
- Authors: Nikita Moghe and Tom Sherborne and Mark Steedman and Alexandra Birch
- Abstract summary: It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
- Score: 78.75776477562087
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic machine translation (MT) metrics are widely used to distinguish the
translation qualities of machine translation systems across relatively large
test sets (system-level evaluation). However, it is unclear if automatic
metrics are reliable at distinguishing good translations from bad translations
at the sentence level (segment-level evaluation). In this paper, we investigate
how useful MT metrics are at detecting the success of a machine translation
component when placed in a larger platform with a downstream task. We evaluate
the segment-level performance of the most widely used MT metrics (chrF, COMET,
BERTScore, etc.) on three downstream cross-lingual tasks (dialogue state
tracking, question answering, and semantic parsing). For each task, we only
have access to a monolingual task-specific model. We calculate the correlation
between the metric's ability to predict a good/bad translation with the
success/failure on the final task for the Translate-Test setup. Our experiments
demonstrate that all metrics exhibit negligible correlation with the extrinsic
evaluation of the downstream outcomes. We also find that the scores provided by
neural metrics are not interpretable mostly because of undefined ranges. We
synthesise our analysis into recommendations for future MT metrics to produce
labels rather than scores for more informative interaction between machine
translation and multilingual language understanding.
Related papers
- Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics [46.71836180414362]
We introduce an interpretable evaluation framework for Machine Translation (MT) metrics.
Within this framework, we evaluate metrics in two scenarios that serve as proxies for the data filtering and translation re-ranking use cases.
We also raise concerns regarding the reliability of manually curated data following the Direct Assessments+Scalar Quality Metrics (DA+SQM) guidelines.
arXiv Detail & Related papers (2024-10-07T16:42:10Z) - Evaluating Automatic Metrics with Incremental Machine Translation Systems [55.78547133890403]
We introduce a dataset comprising commercial machine translations, gathered weekly over six years across 12 translation directions.
We assume commercial systems improve over time, which enables us to evaluate machine translation (MT) metrics based on their preference for more recent translations.
arXiv Detail & Related papers (2024-07-03T17:04:17Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - MuLER: Detailed and Scalable Reference-based Evaluation [24.80921931416632]
We propose a novel methodology that transforms any reference-based evaluation metric for text generation into a fine-grained analysis tool.
Given a system and a metric, MuLER quantifies how much the chosen metric penalizes specific error types.
We perform experiments in both synthetic and naturalistic settings to support MuLER's validity and showcase its usability.
arXiv Detail & Related papers (2023-05-24T10:26:13Z) - Difficulty-Aware Machine Translation Evaluation [19.973201669851626]
We propose a novel difficulty-aware machine translation evaluation metric.
A translation that fails to be predicted by most MT systems will be treated as a difficult one and assigned a large weight in the final score function.
Our proposed method performs well even when all the MT systems are very competitive.
arXiv Detail & Related papers (2021-07-30T02:45:36Z) - On the Limitations of Cross-lingual Encoders as Exposed by
Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity.
This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations.
We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER.
We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.