Exploring Diversity in Back Translation for Low-Resource Machine
Translation
- URL: http://arxiv.org/abs/2206.00564v1
- Date: Wed, 1 Jun 2022 15:21:16 GMT
- Title: Exploring Diversity in Back Translation for Low-Resource Machine
Translation
- Authors: Laurie Burchell, Alexandra Birch, Kenneth Heafield
- Abstract summary: Back translation is one of the most widely used methods for improving the performance of neural machine translation systems.
Recent research has sought to enhance the effectiveness of this method by increasing the 'diversity' of the generated translations.
This work puts forward a more nuanced framework for understanding diversity in training data, splitting it into lexical diversity and syntactic diversity.
- Score: 85.03257601325183
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Back translation is one of the most widely used methods for improving the
performance of neural machine translation systems. Recent research has sought
to enhance the effectiveness of this method by increasing the 'diversity' of
the generated translations. We argue that the definitions and metrics used to
quantify 'diversity' in previous work have been insufficient. This work puts
forward a more nuanced framework for understanding diversity in training data,
splitting it into lexical diversity and syntactic diversity. We present novel
metrics for measuring these different aspects of diversity and carry out
empirical analysis into the effect of these types of diversity on final neural
machine translation model performance for low-resource
English$\leftrightarrow$Turkish and mid-resource
English$\leftrightarrow$Icelandic. Our findings show that generating back
translation using nucleus sampling results in higher final model performance,
and that this method of generation has high levels of both lexical and
syntactic diversity. We also find evidence that lexical diversity is more
important than syntactic for back translation performance.
Related papers
- Towards Tailored Recovery of Lexical Diversity in Literary Machine Translation [11.875491080062233]
Machine translations are found to be lexically poorer than human translations.
We propose a novel approach that consists of reranking translation candidates with a classifier that distinguishes between original and translated text.
We evaluate our approach on 31 English-to-Dutch book translations, and find that, for certain books, our approach retrieves lexical diversity scores that are close to human translation.
arXiv Detail & Related papers (2024-08-30T14:12:04Z) - A Systematic Analysis of Subwords and Cross-Lingual Transfer in Multilingual Translation [8.30255326875704]
Subword regularisation boosts synergy in multilingual modelling, whereas BPE more effectively facilitates transfer during cross-lingual fine-tuning.
Our study confirms that decisions around subword modelling can be key to optimising the benefits of multilingual modelling.
arXiv Detail & Related papers (2024-03-29T13:09:23Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Learning to Generalize to More: Continuous Semantic Augmentation for
Neural Machine Translation [50.54059385277964]
We present a novel data augmentation paradigm termed Continuous Semantic Augmentation (CsaNMT)
CsaNMT augments each training instance with an adjacency region that could cover adequate variants of literal expression under the same meaning.
arXiv Detail & Related papers (2022-04-14T08:16:28Z) - Modelling Latent Translations for Cross-Lingual Transfer [47.61502999819699]
We propose a new technique that integrates both steps of the traditional pipeline (translation and classification) into a single model.
We evaluate our novel latent translation-based model on a series of multilingual NLU tasks.
We report gains for both zero-shot and few-shot learning setups, up to 2.7 accuracy points on average.
arXiv Detail & Related papers (2021-07-23T17:11:27Z) - Decoding and Diversity in Machine Translation [90.33636694717954]
We characterize differences between cost diversity paid for the BLEU scores enjoyed by NMT.
Our study implicates search as a salient source of known bias when translating gender pronouns.
arXiv Detail & Related papers (2020-11-26T21:09:38Z) - Uncertainty-Aware Semantic Augmentation for Neural Machine Translation [37.555675157198145]
We propose uncertainty-aware semantic augmentation, which explicitly captures the universal semantic information among multiple semantically-equivalent source sentences.
Our approach significantly outperforms the strong baselines and the existing methods.
arXiv Detail & Related papers (2020-10-09T07:48:09Z) - Informed Sampling for Diversity in Concept-to-Text NLG [8.883733362171034]
We propose an Imitation Learning approach to explore the level of diversity that a language generation model can reliably produce.
Specifically, we augment the decoding process with a meta-classifier trained to distinguish which words at any given timestep will lead to high-quality output.
arXiv Detail & Related papers (2020-04-29T17:43:24Z) - Translation Artifacts in Cross-lingual Transfer Learning [51.66536640084888]
We show that machine translation can introduce subtle artifacts that have a notable impact in existing cross-lingual models.
In natural language inference, translating the premise and the hypothesis independently can reduce the lexical overlap between them.
We also improve the state-of-the-art in XNLI for the translate-test and zero-shot approaches by 4.3 and 2.8 points, respectively.
arXiv Detail & Related papers (2020-04-09T17:54:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.