Alternated Training with Synthetic and Authentic Data for Neural Machine
Translation
- URL: http://arxiv.org/abs/2106.08582v1
- Date: Wed, 16 Jun 2021 07:13:16 GMT
- Title: Alternated Training with Synthetic and Authentic Data for Neural Machine
Translation
- Authors: Rui Jiao, Zonghan Yang, Maosong Sun and Yang Liu
- Abstract summary: We propose alternated training with synthetic and authentic data for neural machine translation (NMT)
Compared with previous work, we introduce authentic data as guidance to prevent the training of NMT models from being disturbed by noisy synthetic data.
Experiments on Chinese-English and German-English translation tasks show that our approach improves the performance over several strong baselines.
- Score: 49.35605028467887
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While synthetic bilingual corpora have demonstrated their effectiveness in
low-resource neural machine translation (NMT), adding more synthetic data often
deteriorates translation performance. In this work, we propose alternated
training with synthetic and authentic data for NMT. The basic idea is to
alternate synthetic and authentic corpora iteratively during training. Compared
with previous work, we introduce authentic data as guidance to prevent the
training of NMT models from being disturbed by noisy synthetic data.
Experiments on Chinese-English and German-English translation tasks show that
our approach improves the performance over several strong baselines. We
visualize the BLEU landscape to further investigate the role of authentic and
synthetic data during alternated training. From the visualization, we find that
authentic data helps to direct the NMT model parameters towards points with
higher BLEU scores and leads to consistent translation performance improvement.
Related papers
- Non-Fluent Synthetic Target-Language Data Improve Neural Machine
Translation [0.0]
We show that synthetic training samples with non-fluent target sentences can improve translation performance.
This improvement is independent of the size of the original training corpus.
arXiv Detail & Related papers (2024-01-29T11:52:45Z) - Importance-Aware Data Augmentation for Document-Level Neural Machine
Translation [51.74178767827934]
Document-level neural machine translation (DocNMT) aims to generate translations that are both coherent and cohesive.
Due to its longer input length and limited availability of training data, DocNMT often faces the challenge of data sparsity.
We propose a novel Importance-Aware Data Augmentation (IADA) algorithm for DocNMT that augments the training data based on token importance information estimated by the norm of hidden states and training gradients.
arXiv Detail & Related papers (2024-01-27T09:27:47Z) - Better Datastore, Better Translation: Generating Datastores from
Pre-Trained Models for Nearest Neural Machine Translation [48.58899349349702]
Nearest Neighbor Machine Translation (kNNMT) is a simple and effective method of augmenting neural machine translation (NMT) with a token-level nearest neighbor retrieval mechanism.
In this paper, we propose PRED, a framework that leverages Pre-trained models for Datastores in kNN-MT.
arXiv Detail & Related papers (2022-12-17T08:34:20Z) - Improving Simultaneous Machine Translation with Monolingual Data [94.1085601198393]
Simultaneous machine translation (SiMT) is usually done via sequence-level knowledge distillation (Seq-KD) from a full-sentence neural machine translation (NMT) model.
We propose to leverage monolingual data to improve SiMT, which trains a SiMT student on the combination of bilingual data and external monolingual data distilled by Seq-KD.
arXiv Detail & Related papers (2022-12-02T14:13:53Z) - End-to-End Training for Back-Translation with Categorical Reparameterization Trick [0.0]
Back-translation is an effective semi-supervised learning framework in neural machine translation (NMT)
A pre-trained NMT model translates monolingual sentences and makes synthetic bilingual sentence pairs for the training of the other NMT model.
The discrete property of translated sentences prevents information gradient from flowing between the two NMT models.
arXiv Detail & Related papers (2022-02-17T06:31:03Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z) - Synthetic Source Language Augmentation for Colloquial Neural Machine
Translation [3.303435360096988]
We develop a novel colloquial Indonesian-English test-set collected from YouTube transcript and Twitter.
We perform synthetic style augmentation to the source of formal Indonesian language and show that it improves the baseline Id-En models.
arXiv Detail & Related papers (2020-12-30T14:52:15Z) - On the Inference Calibration of Neural Machine Translation [54.48932804996506]
We study the correlation between calibration and translation performance and linguistic properties of miscalibration.
We propose a new graduated label smoothing method that can improve both inference calibration and translation performance.
arXiv Detail & Related papers (2020-05-03T02:03:56Z) - AR: Auto-Repair the Synthetic Data for Neural Machine Translation [34.36472405208541]
We propose a novel Auto- Repair (AR) framework to improve the quality of synthetic data.
Our proposed AR model can learn the transformation from low quality (noisy) input sentence to high quality sentence.
Our approach can effective improve the quality of synthetic parallel data and the NMT model with the repaired synthetic data achieves consistent improvements.
arXiv Detail & Related papers (2020-04-05T13:18:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.