ACES: Translation Accuracy Challenge Sets at WMT 2023
- URL: http://arxiv.org/abs/2311.01153v1
- Date: Thu, 2 Nov 2023 11:29:09 GMT
- Title: ACES: Translation Accuracy Challenge Sets at WMT 2023
- Authors: Chantal Amrhein and Nikita Moghe and Liane Guillou
- Abstract summary: We benchmark the performance of segmentlevel metrics submitted to WMT 2023 using the ACES Challenge Set.
The challenge set consists of 36K examples representing challenges from 68 phenomena and covering 146 language pairs.
For each metric, we provide a detailed profile of performance over a range of error categories as well as an overall ACES-Score for quick comparison.
- Score: 7.928752019133836
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We benchmark the performance of segmentlevel metrics submitted to WMT 2023
using the ACES Challenge Set (Amrhein et al., 2022). The challenge set consists
of 36K examples representing challenges from 68 phenomena and covering 146
language pairs. The phenomena range from simple perturbations at the
word/character level to more complex errors based on discourse and real-world
knowledge. For each metric, we provide a detailed profile of performance over a
range of error categories as well as an overall ACES-Score for quick
comparison. We also measure the incremental performance of the metrics
submitted to both WMT 2023 and 2022. We find that 1) there is no clear winner
among the metrics submitted to WMT 2023, and 2) performance change between the
2023 and 2022 versions of the metrics is highly variable. Our recommendations
are similar to those from WMT 2022. Metric developers should focus on: building
ensembles of metrics from different design families, developing metrics that
pay more attention to the source and rely less on surface-level overlap, and
carefully determining the influence of multilingual embeddings on MT
evaluation.
Related papers
- MetricX-24: The Google Submission to the WMT 2024 Metrics Shared Task [21.490930342296256]
We present the MetricX-24 submissions to the WMT24 Metrics Shared Task.
Our primary submission is a hybrid reference-based/free metric.
We show a significant performance increase over MetricX-23 on the WMT23 MQM ratings, as well as our new synthetic challenge set.
arXiv Detail & Related papers (2024-10-04T23:52:28Z) - Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! [80.3129093617928]
Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics.
This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings.
We introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness.
arXiv Detail & Related papers (2024-08-25T13:29:34Z) - The MuSe 2024 Multimodal Sentiment Analysis Challenge: Social Perception and Humor Recognition [64.5207572897806]
The Multimodal Sentiment Analysis Challenge (MuSe) 2024 addresses two contemporary multimodal affect and sentiment analysis problems.
In the Social Perception Sub-Challenge (MuSe-Perception), participants will predict 16 different social attributes of individuals.
The Cross-Cultural Humor Detection Sub-Challenge (MuSe-Humor) dataset expands upon the Passau Spontaneous Football Coach Humor dataset.
arXiv Detail & Related papers (2024-06-11T22:26:20Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - Unify word-level and span-level tasks: NJUNLP's Participation for the
WMT2023 Quality Estimation Shared Task [59.46906545506715]
We introduce the NJUNLP team to the WMT 2023 Quality Estimation (QE) shared task.
Our team submitted predictions for the English-German language pair on all two sub-tasks.
Our models achieved the best results in English-German for both word-level and fine-grained error span detection sub-tasks.
arXiv Detail & Related papers (2023-09-23T01:52:14Z) - Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models [57.80514758695275]
Using large language models (LLMs) for assessing the quality of machine translation (MT) achieves state-of-the-art performance at the system level.
We propose a new prompting method called textbftextttError Analysis Prompting (EAPrompt)
This technique emulates the commonly accepted human evaluation framework - Multidimensional Quality Metrics (MQM) and textitproduces explainable and reliable MT evaluations at both the system and segment level.
arXiv Detail & Related papers (2023-03-24T05:05:03Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - ACES: Translation Accuracy Challenge Sets for Evaluating Machine
Translation Metrics [2.48769664485308]
Machine translation (MT) metrics improve their correlation with human judgement every year.
It is important to investigate metric behaviour when facing accuracy errors in MT.
We curate ACES, a translation accuracy challenge set, consisting of 68 phenomena ranging from simple perturbations at the word/character level to more complex errors based on discourse and real-world knowledge.
arXiv Detail & Related papers (2022-10-27T16:59:02Z) - Learning to Evaluate Translation Beyond English: BLEURT Submissions to
the WMT Metrics 2020 Shared Task [30.889496911261677]
This paper describes our contribution to the WMT 2020 Metrics Shared Task.
We make several submissions based on BLEURT, a metric based on transfer learning.
We show how to combine BLEURT's predictions with those of YiSi and use alternative reference translations to enhance the performance.
arXiv Detail & Related papers (2020-10-08T23:16:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.