ACES: Translation Accuracy Challenge Sets for Evaluating Machine
Translation Metrics
- URL: http://arxiv.org/abs/2210.15615v1
- Date: Thu, 27 Oct 2022 16:59:02 GMT
- Title: ACES: Translation Accuracy Challenge Sets for Evaluating Machine
Translation Metrics
- Authors: Chantal Amrhein and Nikita Moghe and Liane Guillou
- Abstract summary: Machine translation (MT) metrics improve their correlation with human judgement every year.
It is important to investigate metric behaviour when facing accuracy errors in MT.
We curate ACES, a translation accuracy challenge set, consisting of 68 phenomena ranging from simple perturbations at the word/character level to more complex errors based on discourse and real-world knowledge.
- Score: 2.48769664485308
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As machine translation (MT) metrics improve their correlation with human
judgement every year, it is crucial to understand the limitations of such
metrics at the segment level. Specifically, it is important to investigate
metric behaviour when facing accuracy errors in MT because these can have
dangerous consequences in certain contexts (e.g., legal, medical). We curate
ACES, a translation accuracy challenge set, consisting of 68 phenomena ranging
from simple perturbations at the word/character level to more complex errors
based on discourse and real-world knowledge. We use ACES to evaluate a wide
range of MT metrics including the submissions to the WMT 2022 metrics shared
task and perform several analyses leading to general recommendations for metric
developers. We recommend: a) combining metrics with different strengths, b)
developing metrics that give more weight to the source and less to
surface-level overlap with the reference and c) explicitly modelling additional
language-specific information beyond what is available via multilingual
embeddings.
Related papers
- Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics [46.71836180414362]
We introduce an interpretable evaluation framework for Machine Translation (MT) metrics.
Within this framework, we evaluate metrics in two scenarios that serve as proxies for the data filtering and translation re-ranking use cases.
We also raise concerns regarding the reliability of manually curated data following the Direct Assessments+Scalar Quality Metrics (DA+SQM) guidelines.
arXiv Detail & Related papers (2024-10-07T16:42:10Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - ACES: Translation Accuracy Challenge Sets at WMT 2023 [7.928752019133836]
We benchmark the performance of segmentlevel metrics submitted to WMT 2023 using the ACES Challenge Set.
The challenge set consists of 36K examples representing challenges from 68 phenomena and covering 146 language pairs.
For each metric, we provide a detailed profile of performance over a range of error categories as well as an overall ACES-Score for quick comparison.
arXiv Detail & Related papers (2023-11-02T11:29:09Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models [57.80514758695275]
Using large language models (LLMs) for assessing the quality of machine translation (MT) achieves state-of-the-art performance at the system level.
We propose a new prompting method called textbftextttError Analysis Prompting (EAPrompt)
This technique emulates the commonly accepted human evaluation framework - Multidimensional Quality Metrics (MQM) and textitproduces explainable and reliable MT evaluations at both the system and segment level.
arXiv Detail & Related papers (2023-03-24T05:05:03Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - DEMETR: Diagnosing Evaluation Metrics for Translation [21.25704103403547]
We release DEMETR, a diagnostic dataset with 31K English examples.
We find that learned metrics perform substantially better than string-based metrics on DEMETR.
arXiv Detail & Related papers (2022-10-25T03:25:44Z) - When Does Translation Require Context? A Data-driven, Multilingual
Exploration [71.43817945875433]
proper handling of discourse significantly contributes to the quality of machine translation (MT)
Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation.
We develop the Multilingual Discourse-Aware benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena.
arXiv Detail & Related papers (2021-09-15T17:29:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.