Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets
- URL: http://arxiv.org/abs/2401.16313v1
- Date: Mon, 29 Jan 2024 17:17:42 GMT
- Title: Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets
- Authors: Nikita Moghe, Arnisa Fazla, Chantal Amrhein, Tom Kocmi, Mark Steedman,
Alexandra Birch, Rico Sennrich, Liane Guillou
- Abstract summary: We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
- Score: 92.38654521870444
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent machine translation (MT) metrics calibrate their effectiveness by
correlating with human judgement but without any insights about their behaviour
across different error types. Challenge sets are used to probe specific
dimensions of metric behaviour but there are very few such datasets and they
either focus on a limited number of phenomena or a limited number of language
pairs. We introduce ACES, a contrastive challenge set spanning 146 language
pairs, aimed at discovering whether metrics can identify 68 translation
accuracy errors. These phenomena range from simple alterations at the
word/character level to more complex errors based on discourse and real-world
knowledge. We conduct a large-scale study by benchmarking ACES on 50 metrics
submitted to the WMT 2022 and 2023 metrics shared tasks. We benchmark metric
performance, assess their incremental performance over successive campaigns,
and measure their sensitivity to a range of linguistic phenomena. We also
investigate claims that Large Language Models (LLMs) are effective as MT
evaluators by evaluating on ACES. Our results demonstrate that different metric
families struggle with different phenomena and that LLM-based methods fail to
demonstrate reliable performance. Our analyses indicate that most metrics
ignore the source sentence, tend to prefer surface-level overlap and end up
incorporating properties of base models which are not always beneficial. We
expand ACES to include error span annotations, denoted as SPAN-ACES and we use
this dataset to evaluate span-based error metrics showing these metrics also
need considerable improvement. Finally, we provide a set of recommendations for
building better MT metrics, including focusing on error labels instead of
scores, ensembling, designing strategies to explicitly focus on the source
sentence, focusing on semantic content and choosing the right base model for
representations.
Related papers
- ACES: Translation Accuracy Challenge Sets at WMT 2023 [7.928752019133836]
We benchmark the performance of segmentlevel metrics submitted to WMT 2023 using the ACES Challenge Set.
The challenge set consists of 36K examples representing challenges from 68 phenomena and covering 146 language pairs.
For each metric, we provide a detailed profile of performance over a range of error categories as well as an overall ACES-Score for quick comparison.
arXiv Detail & Related papers (2023-11-02T11:29:09Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - MuLER: Detailed and Scalable Reference-based Evaluation [24.80921931416632]
We propose a novel methodology that transforms any reference-based evaluation metric for text generation into a fine-grained analysis tool.
Given a system and a metric, MuLER quantifies how much the chosen metric penalizes specific error types.
We perform experiments in both synthetic and naturalistic settings to support MuLER's validity and showcase its usability.
arXiv Detail & Related papers (2023-05-24T10:26:13Z) - Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models [57.80514758695275]
Using large language models (LLMs) for assessing the quality of machine translation (MT) achieves state-of-the-art performance at the system level.
We propose a new prompting method called textbftextttError Analysis Prompting (EAPrompt)
This technique emulates the commonly accepted human evaluation framework - Multidimensional Quality Metrics (MQM) and textitproduces explainable and reliable MT evaluations at both the system and segment level.
arXiv Detail & Related papers (2023-03-24T05:05:03Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - ACES: Translation Accuracy Challenge Sets for Evaluating Machine
Translation Metrics [2.48769664485308]
Machine translation (MT) metrics improve their correlation with human judgement every year.
It is important to investigate metric behaviour when facing accuracy errors in MT.
We curate ACES, a translation accuracy challenge set, consisting of 68 phenomena ranging from simple perturbations at the word/character level to more complex errors based on discourse and real-world knowledge.
arXiv Detail & Related papers (2022-10-27T16:59:02Z) - DEMETR: Diagnosing Evaluation Metrics for Translation [21.25704103403547]
We release DEMETR, a diagnostic dataset with 31K English examples.
We find that learned metrics perform substantially better than string-based metrics on DEMETR.
arXiv Detail & Related papers (2022-10-25T03:25:44Z) - A global analysis of metrics used for measuring performance in natural
language processing [9.433496814327086]
We provide the first large-scale cross-sectional analysis of metrics used for measuring performance in natural language processing.
Results suggest that the large majority of natural language processing metrics currently used have properties that may result in an inadequate reflection of a models' performance.
arXiv Detail & Related papers (2022-04-25T11:41:50Z) - When Does Translation Require Context? A Data-driven, Multilingual
Exploration [71.43817945875433]
proper handling of discourse significantly contributes to the quality of machine translation (MT)
Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation.
We develop the Multilingual Discourse-Aware benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena.
arXiv Detail & Related papers (2021-09-15T17:29:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.