BLEU might be Guilty but References are not Innocent
- URL: http://arxiv.org/abs/2004.06063v2
- Date: Tue, 20 Oct 2020 13:02:12 GMT
- Title: BLEU might be Guilty but References are not Innocent
- Authors: Markus Freitag, David Grangier, Isaac Caswell
- Abstract summary: We study different methods to collect references and compare their value in automated evaluation.
Motivated by the finding that typical references exhibit poor diversity, concentrating around translationese language, we develop a paraphrasing task.
Our method yields higher correlation with human judgment not only for the submissions of WMT 2019 English to German, but also for Back-translation and APE augmented MT output.
- Score: 34.817010352734
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The quality of automatic metrics for machine translation has been
increasingly called into question, especially for high-quality systems. This
paper demonstrates that, while choice of metric is important, the nature of the
references is also critical. We study different methods to collect references
and compare their value in automated evaluation by reporting correlation with
human evaluation for a variety of systems and metrics. Motivated by the finding
that typical references exhibit poor diversity, concentrating around
translationese language, we develop a paraphrasing task for linguists to
perform on existing reference translations, which counteracts this bias. Our
method yields higher correlation with human judgment not only for the
submissions of WMT 2019 English to German, but also for Back-translation and
APE augmented MT output, which have been shown to have low correlation with
automatic metrics using standard references. We demonstrate that our
methodology improves correlation with all modern evaluation metrics we look at,
including embedding-based methods. To complete this picture, we reveal that
multi-reference BLEU does not improve the correlation for high quality output,
and present an alternative multi-reference formulation that is more effective.
Related papers
- Can Automatic Metrics Assess High-Quality Translations? [28.407966066693334]
We show that current metrics are insensitive to nuanced differences in translation quality.
This effect is most pronounced when the quality is high and the variance among alternatives is low.
Using the MQM framework as the gold standard, we systematically stress-test the ability of current metrics to identify translations with no errors as marked by humans.
arXiv Detail & Related papers (2024-05-28T16:44:02Z) - Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - Towards Explainable Evaluation Metrics for Natural Language Generation [36.594817754285984]
We identify key properties and propose key goals of explainable machine translation evaluation metrics.
We conduct own novel experiments, which find that current adversarial NLP techniques are unsuitable for automatically identifying limitations of high-quality black-box evaluation metrics.
arXiv Detail & Related papers (2022-03-21T17:05:54Z) - When Does Translation Require Context? A Data-driven, Multilingual
Exploration [71.43817945875433]
proper handling of discourse significantly contributes to the quality of machine translation (MT)
Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation.
We develop the Multilingual Discourse-Aware benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena.
arXiv Detail & Related papers (2021-09-15T17:29:30Z) - Decoding and Diversity in Machine Translation [90.33636694717954]
We characterize differences between cost diversity paid for the BLEU scores enjoyed by NMT.
Our study implicates search as a salient source of known bias when translating gender pronouns.
arXiv Detail & Related papers (2020-11-26T21:09:38Z) - Human-Paraphrased References Improve Neural Machine Translation [33.86920777067357]
We show that tuning to paraphrased references produces a system that is significantly better according to human judgment.
Our work confirms the finding that paraphrased references yield metric scores that correlate better with human judgment.
arXiv Detail & Related papers (2020-10-20T13:14:57Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z) - On the Limitations of Cross-lingual Encoders as Exposed by
Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity.
This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations.
We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER.
We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.