Non-Fluent Synthetic Target-Language Data Improve Neural Machine
Translation
- URL: http://arxiv.org/abs/2401.16086v1
- Date: Mon, 29 Jan 2024 11:52:45 GMT
- Title: Non-Fluent Synthetic Target-Language Data Improve Neural Machine
Translation
- Authors: V\'ictor M. S\'anchez-Cartagena, Miquel Espl\`a-Gomis, Juan Antonio
P\'erez-Ortiz, Felipe S\'anchez-Mart\'inez
- Abstract summary: We show that synthetic training samples with non-fluent target sentences can improve translation performance.
This improvement is independent of the size of the original training corpus.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When the amount of parallel sentences available to train a neural machine
translation is scarce, a common practice is to generate new synthetic training
samples from them. A number of approaches have been proposed to produce
synthetic parallel sentences that are similar to those in the parallel data
available. These approaches work under the assumption that non-fluent
target-side synthetic training samples can be harmful and may deteriorate
translation performance. Even so, in this paper we demonstrate that synthetic
training samples with non-fluent target sentences can improve translation
performance if they are used in a multilingual machine translation framework as
if they were sentences in another language. We conducted experiments on ten
low-resource and four high-resource translation tasks and found out that this
simple approach consistently improves translation performance as compared to
state-of-the-art methods for generating synthetic training samples similar to
those found in corpora. Furthermore, this improvement is independent of the
size of the original training corpus, the resulting systems are much more
robust against domain shift and produce less hallucinations.
Related papers
- Synthetic Pre-Training Tasks for Neural Machine Translation [16.6378815054841]
Our goal is to understand the factors that contribute to the effectiveness of pre-training models when using synthetic resources.
We propose several novel approaches to pre-training translation models that involve different levels of lexical and structural knowledge.
Our experiments on multiple language pairs reveal that pre-training benefits can be realized even with high levels of obfuscation or purely synthetic parallel data.
arXiv Detail & Related papers (2022-12-19T21:34:00Z) - Distributionally Robust Multilingual Machine Translation [94.51866646879337]
We propose a new learning objective for Multilingual neural machine translation (MNMT) based on distributionally robust optimization.
We show how to practically optimize this objective for large translation corpora using an iterated best response scheme.
Our method consistently outperforms strong baseline methods in terms of average and per-language performance under both many-to-one and one-to-many translation settings.
arXiv Detail & Related papers (2021-09-09T03:48:35Z) - Modelling Latent Translations for Cross-Lingual Transfer [47.61502999819699]
We propose a new technique that integrates both steps of the traditional pipeline (translation and classification) into a single model.
We evaluate our novel latent translation-based model on a series of multilingual NLU tasks.
We report gains for both zero-shot and few-shot learning setups, up to 2.7 accuracy points on average.
arXiv Detail & Related papers (2021-07-23T17:11:27Z) - Alternated Training with Synthetic and Authentic Data for Neural Machine
Translation [49.35605028467887]
We propose alternated training with synthetic and authentic data for neural machine translation (NMT)
Compared with previous work, we introduce authentic data as guidance to prevent the training of NMT models from being disturbed by noisy synthetic data.
Experiments on Chinese-English and German-English translation tasks show that our approach improves the performance over several strong baselines.
arXiv Detail & Related papers (2021-06-16T07:13:16Z) - Self-Training Sampling with Monolingual Data Uncertainty for Neural
Machine Translation [98.83925811122795]
We propose to improve the sampling procedure by selecting the most informative monolingual sentences to complement the parallel data.
We compute the uncertainty of monolingual sentences using the bilingual dictionary extracted from the parallel data.
Experimental results on large-scale WMT English$Rightarrow$German and English$Rightarrow$Chinese datasets demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2021-06-02T05:01:36Z) - Meta Back-translation [111.87397401837286]
We propose a novel method to generate pseudo-parallel data from a pre-trained back-translation model.
Our method is a meta-learning algorithm which adapts a pre-trained back-translation model so that the pseudo-parallel data it generates would train a forward-translation model to do well on a validation set.
arXiv Detail & Related papers (2021-02-15T20:58:32Z) - Enriching Non-Autoregressive Transformer with Syntactic and
SemanticStructures for Neural Machine Translation [54.864148836486166]
We propose to incorporate the explicit syntactic and semantic structures of languages into a non-autoregressive Transformer.
Our model achieves a significantly faster speed, as well as keeps the translation quality when compared with several state-of-the-art non-autoregressive models.
arXiv Detail & Related papers (2021-01-22T04:12:17Z) - Incorporating Bilingual Dictionaries for Low Resource Semi-Supervised
Neural Machine Translation [5.958653653305609]
We incorporate widely available bilingual dictionaries that yield word-by-word translations to generate synthetic sentences.
This automatically expands the vocabulary of the model while maintaining high quality content.
arXiv Detail & Related papers (2020-04-05T02:14:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.