On the Limitations of Cross-lingual Encoders as Exposed by
Reference-Free Machine Translation Evaluation
- URL: http://arxiv.org/abs/2005.01196v3
- Date: Mon, 8 Jun 2020 11:27:25 GMT
- Title: On the Limitations of Cross-lingual Encoders as Exposed by
Reference-Free Machine Translation Evaluation
- Authors: Wei Zhao, Goran Glava\v{s}, Maxime Peyrard, Yang Gao, Robert West,
Steffen Eger
- Abstract summary: Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity.
This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations.
We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER.
We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
- Score: 55.02832094101173
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluation of cross-lingual encoders is usually performed either via
zero-shot cross-lingual transfer in supervised downstream tasks or via
unsupervised cross-lingual textual similarity. In this paper, we concern
ourselves with reference-free machine translation (MT) evaluation where we
directly compare source texts to (sometimes low-quality) system translations,
which represents a natural adversarial setup for multilingual encoders.
Reference-free evaluation holds the promise of web-scale comparison of MT
systems. We systematically investigate a range of metrics based on
state-of-the-art cross-lingual semantic representations obtained with
pretrained M-BERT and LASER. We find that they perform poorly as semantic
encoders for reference-free MT evaluation and identify their two key
limitations, namely, (a) a semantic mismatch between representations of mutual
translations and, more prominently, (b) the inability to punish
"translationese", i.e., low-quality literal translations. We propose two
partial remedies: (1) post-hoc re-alignment of the vector spaces and (2)
coupling of semantic-similarity based metrics with target-side language
modeling. In segment-level MT evaluation, our best metric surpasses
reference-based BLEU by 5.7 correlation points.
Related papers
- BiVert: Bidirectional Vocabulary Evaluation using Relations for Machine
Translation [4.651581292181871]
We propose a bidirectional semantic-based evaluation method designed to assess the sense distance of the translation from the source text.
This approach employs the comprehensive multilingual encyclopedic dictionary BabelNet.
Factual analysis shows a strong correlation between the average evaluation scores generated by our method and the human assessments across various machine translation systems for English-German language pair.
arXiv Detail & Related papers (2024-03-06T08:02:21Z) - MT-Ranker: Reference-free machine translation evaluation by inter-system
ranking [14.188948302661933]
We show that MT-Ranker, trained without any human annotations, achieves state-of-the-art results on the WMT Shared Metrics Task benchmarks DARR20, MQM20, and MQM21.
MT-Ranker marks state-of-the-art against reference-free as well as reference-based baselines.
arXiv Detail & Related papers (2024-01-30T15:30:03Z) - Towards Effective Disambiguation for Machine Translation with Large
Language Models [65.80775710657672]
We study the capabilities of large language models to translate "ambiguous sentences"
Experiments show that our methods can match or outperform state-of-the-art systems such as DeepL and NLLB in four out of five language directions.
arXiv Detail & Related papers (2023-09-20T22:22:52Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments.
We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z) - Modelling Latent Translations for Cross-Lingual Transfer [47.61502999819699]
We propose a new technique that integrates both steps of the traditional pipeline (translation and classification) into a single model.
We evaluate our novel latent translation-based model on a series of multilingual NLU tasks.
We report gains for both zero-shot and few-shot learning setups, up to 2.7 accuracy points on average.
arXiv Detail & Related papers (2021-07-23T17:11:27Z) - BLEU might be Guilty but References are not Innocent [34.817010352734]
We study different methods to collect references and compare their value in automated evaluation.
Motivated by the finding that typical references exhibit poor diversity, concentrating around translationese language, we develop a paraphrasing task.
Our method yields higher correlation with human judgment not only for the submissions of WMT 2019 English to German, but also for Back-translation and APE augmented MT output.
arXiv Detail & Related papers (2020-04-13T16:49:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.