Decoding and Diversity in Machine Translation
- URL: http://arxiv.org/abs/2011.13477v1
- Date: Thu, 26 Nov 2020 21:09:38 GMT
- Title: Decoding and Diversity in Machine Translation
- Authors: Nicholas Roberts, Davis Liang, Graham Neubig, Zachary C. Lipton
- Abstract summary: We characterize differences between cost diversity paid for the BLEU scores enjoyed by NMT.
Our study implicates search as a salient source of known bias when translating gender pronouns.
- Score: 90.33636694717954
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural Machine Translation (NMT) systems are typically evaluated using
automated metrics that assess the agreement between generated translations and
ground truth candidates. To improve systems with respect to these metrics, NLP
researchers employ a variety of heuristic techniques, including searching for
the conditional mode (vs. sampling) and incorporating various training
heuristics (e.g., label smoothing). While search strategies significantly
improve BLEU score, they yield deterministic outputs that lack the diversity of
human translations. Moreover, search tends to bias the distribution of
translated gender pronouns. This makes human-level BLEU a misleading benchmark
in that modern MT systems cannot approach human-level BLEU while simultaneously
maintaining human-level translation diversity. In this paper, we characterize
distributional differences between generated and real translations, examining
the cost in diversity paid for the BLEU scores enjoyed by NMT. Moreover, our
study implicates search as a salient source of known bias when translating
gender pronouns.
Related papers
- BiVert: Bidirectional Vocabulary Evaluation using Relations for Machine
Translation [4.651581292181871]
We propose a bidirectional semantic-based evaluation method designed to assess the sense distance of the translation from the source text.
This approach employs the comprehensive multilingual encyclopedic dictionary BabelNet.
Factual analysis shows a strong correlation between the average evaluation scores generated by our method and the human assessments across various machine translation systems for English-German language pair.
arXiv Detail & Related papers (2024-03-06T08:02:21Z) - BLEURT Has Universal Translations: An Analysis of Automatic Metrics by
Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems.
We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore.
In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z) - Revisiting Machine Translation for Cross-lingual Classification [91.43729067874503]
Most research in the area focuses on the multilingual models rather than the Machine Translation component.
We show that, by using a stronger MT system and mitigating the mismatch between training on original text and running inference on machine translated text, translate-test can do substantially better than previously assumed.
arXiv Detail & Related papers (2023-05-23T16:56:10Z) - Discourse Centric Evaluation of Machine Translation with a Densely
Annotated Parallel Corpus [82.07304301996562]
This paper presents a new dataset with rich discourse annotations, built upon the large-scale parallel corpus BWB introduced in Jiang et al.
We investigate the similarities and differences between the discourse structures of source and target languages.
We discover that MT outputs differ fundamentally from human translations in terms of their latent discourse structures.
arXiv Detail & Related papers (2023-05-18T17:36:41Z) - HilMeMe: A Human-in-the-Loop Machine Translation Evaluation Metric
Looking into Multi-Word Expressions [6.85316573653194]
We describe the design and implementation of a linguistically motivated human-in-the-loop evaluation metric looking into idiomatic and terminological Multi-word Expressions (MWEs)
MWEs can be used as one of the main factors to distinguish different MT systems by looking into their capabilities in recognising and translating MWEs in an accurate and meaning equivalent manner.
arXiv Detail & Related papers (2022-11-09T21:15:40Z) - Exploring Diversity in Back Translation for Low-Resource Machine
Translation [85.03257601325183]
Back translation is one of the most widely used methods for improving the performance of neural machine translation systems.
Recent research has sought to enhance the effectiveness of this method by increasing the 'diversity' of the generated translations.
This work puts forward a more nuanced framework for understanding diversity in training data, splitting it into lexical diversity and syntactic diversity.
arXiv Detail & Related papers (2022-06-01T15:21:16Z) - Sentiment-based Candidate Selection for NMT [2.580271290008534]
We propose a decoder-side approach that incorporates automatic sentiment scoring into the machine translation (MT) candidate selection process.
We train separate English and Spanish sentiment classifiers, then, using n-best candidates generated by a baseline MT model with beam search, select the candidate that minimizes the absolute difference between the sentiment score of the source sentence and that of the translation.
The results of human evaluations show that, in comparison to the open-source MT model on top of which our pipeline is built, our baseline translations are more accurate of colloquial, sentiment-heavy source texts.
arXiv Detail & Related papers (2021-04-10T19:01:52Z) - Machine Translationese: Effects of Algorithmic Bias on Linguistic
Complexity in Machine Translation [2.0625936401496237]
We go beyond the study of gender in Machine Translation and investigate how bias amplification might affect language in a broader sense.
We assess the linguistic richness (on a lexical and morphological level) of translations created by different data-driven MT paradigms.
arXiv Detail & Related papers (2021-01-30T18:49:11Z) - On the Limitations of Cross-lingual Encoders as Exposed by
Reference-Free Machine Translation Evaluation [55.02832094101173]
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual similarity.
This paper concerns ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations.
We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER.
We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations.
arXiv Detail & Related papers (2020-05-03T22:10:23Z) - BLEU might be Guilty but References are not Innocent [34.817010352734]
We study different methods to collect references and compare their value in automated evaluation.
Motivated by the finding that typical references exhibit poor diversity, concentrating around translationese language, we develop a paraphrasing task.
Our method yields higher correlation with human judgment not only for the submissions of WMT 2019 English to German, but also for Back-translation and APE augmented MT output.
arXiv Detail & Related papers (2020-04-13T16:49:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.