DEMETR: Diagnosing Evaluation Metrics for Translation
- URL: http://arxiv.org/abs/2210.13746v1
- Date: Tue, 25 Oct 2022 03:25:44 GMT
- Title: DEMETR: Diagnosing Evaluation Metrics for Translation
- Authors: Marzena Karpinska and Nishant Raj and Katherine Thai and Yixiao Song
and Ankita Gupta and Mohit Iyyer
- Abstract summary: We release DEMETR, a diagnostic dataset with 31K English examples.
We find that learned metrics perform substantially better than string-based metrics on DEMETR.
- Score: 21.25704103403547
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While machine translation evaluation metrics based on string overlap (e.g.,
BLEU) have their limitations, their computations are transparent: the BLEU
score assigned to a particular candidate translation can be traced back to the
presence or absence of certain words. The operations of newer learned metrics
(e.g., BLEURT, COMET), which leverage pretrained language models to achieve
higher correlations with human quality judgments than BLEU, are opaque in
comparison. In this paper, we shed light on the behavior of these learned
metrics by creating DEMETR, a diagnostic dataset with 31K English examples
(translated from 10 source languages) for evaluating the sensitivity of MT
evaluation metrics to 35 different linguistic perturbations spanning semantic,
syntactic, and morphological error categories. All perturbations were carefully
designed to form minimal pairs with the actual translation (i.e., differ in
only one aspect). We find that learned metrics perform substantially better
than string-based metrics on DEMETR. Additionally, learned metrics differ in
their sensitivity to various phenomena (e.g., BERTScore is sensitive to
untranslated words but relatively insensitive to gender manipulation, while
COMET is much more sensitive to word repetition than to aspectual changes). We
publicly release DEMETR to spur more informed future development of machine
translation evaluation metrics
Related papers
- Syntactic Language Change in English and German: Metrics, Parsers, and Convergences [56.47832275431858]
The current paper looks at diachronic trends in syntactic language change in both English and German, using corpora of parliamentary debates from the last c. 160 years.
We base our observations on five dependencys, including the widely used Stanford Core as well as 4 newer alternatives.
We show that changes in syntactic measures seem to be more frequent at the tails of sentence length distributions.
arXiv Detail & Related papers (2024-02-18T11:46:16Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust
Machine Translation Evaluation [12.407789866525079]
We show that by using additional information during training, such as sentence-level features and word-level tags, the trained metrics improve their capability to penalize translations with specific troublesome phenomena.
We show that by using additional information during training, such as sentence-level features and word-level tags, the trained metrics improve their capability to penalize translations with specific troublesome phenomena.
arXiv Detail & Related papers (2023-05-30T15:50:46Z) - The Inside Story: Towards Better Understanding of Machine Translation
Neural Evaluation Metrics [8.432864879027724]
We develop and compare several neural explainability methods and demonstrate their effectiveness for interpreting state-of-the-art fine-tuned neural metrics.
Our study reveals that these metrics leverage token-level information that can be directly attributed to translation errors.
arXiv Detail & Related papers (2023-05-19T16:42:17Z) - BMX: Boosting Natural Language Generation Metrics with Explainability [23.8476163398993]
BMX: Boosting Natural Language Generation Metrics with explainability explicitly leverages explanations to boost the metrics' performance.
Our tests show improvements for multiple metrics across MT and summarization datasets.
arXiv Detail & Related papers (2022-12-20T17:41:18Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - ACES: Translation Accuracy Challenge Sets for Evaluating Machine
Translation Metrics [2.48769664485308]
Machine translation (MT) metrics improve their correlation with human judgement every year.
It is important to investigate metric behaviour when facing accuracy errors in MT.
We curate ACES, a translation accuracy challenge set, consisting of 68 phenomena ranging from simple perturbations at the word/character level to more complex errors based on discourse and real-world knowledge.
arXiv Detail & Related papers (2022-10-27T16:59:02Z) - Not All Errors are Equal: Learning Text Generation Metrics using
Stratified Error Synthesis [79.18261352971284]
We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation.
We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings.
SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
arXiv Detail & Related papers (2022-10-10T22:30:26Z) - On the Limitations of Cross-lingual Encoders as Exposed by
Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity.
This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations.
We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER.
We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.