Understanding and Improving Sequence-to-Sequence Pretraining for Neural
Machine Translation
- URL: http://arxiv.org/abs/2203.08442v1
- Date: Wed, 16 Mar 2022 07:36:28 GMT
- Title: Understanding and Improving Sequence-to-Sequence Pretraining for Neural
Machine Translation
- Authors: Wenxuan Wang, Wenxiang Jiao, Yongchang Hao, Xing Wang, Shuming Shi,
Zhaopeng Tu, Michael Lyu
- Abstract summary: We study the impact of the jointly pretrained decoder, which is the main difference between Seq2Seq pretraining and previous encoder-based pretraining approaches for NMT.
We propose simple and effective strategies, named in-domain pretraining and input adaptation to remedy the domain and objective discrepancies.
- Score: 48.50842995206353
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a substantial step in better understanding the SOTA
sequence-to-sequence (Seq2Seq) pretraining for neural machine
translation~(NMT). We focus on studying the impact of the jointly pretrained
decoder, which is the main difference between Seq2Seq pretraining and previous
encoder-based pretraining approaches for NMT. By carefully designing
experiments on three language pairs, we find that Seq2Seq pretraining is a
double-edged sword: On one hand, it helps NMT models to produce more diverse
translations and reduce adequacy-related translation errors. On the other hand,
the discrepancies between Seq2Seq pretraining and NMT finetuning limit the
translation quality (i.e., domain discrepancy) and induce the over-estimation
issue (i.e., objective discrepancy). Based on these observations, we further
propose simple and effective strategies, named in-domain pretraining and input
adaptation to remedy the domain and objective discrepancies, respectively.
Experimental results on several language pairs show that our approach can
consistently improve both translation performance and model robustness upon
Seq2Seq pretraining.
Related papers
- On the Pareto Front of Multilingual Neural Machine Translation [123.94355117635293]
We study how the performance of a given direction changes with its sampling ratio in Neural Machine Translation (MNMT)
We propose the Double Power Law to predict the unique performance trade-off front in MNMT.
In our experiments, it achieves better performance than temperature searching and gradient manipulation methods with only 1/5 to 1/2 of the total training budget.
arXiv Detail & Related papers (2023-04-06T16:49:19Z) - Denoising-based UNMT is more robust to word-order divergence than
MASS-based UNMT [27.85834441076481]
We investigate whether UNMT approaches with self-supervised pre-training are robust to word-order divergence between language pairs.
We compare two models pre-trained with the same self-supervised pre-training objective.
We observe that DAE-based UNMT approach consistently outperforms MASS in terms of translation accuracies.
arXiv Detail & Related papers (2023-03-02T12:11:58Z) - Non-Parametric Domain Adaptation for End-to-End Speech Translation [72.37869362559212]
End-to-End Speech Translation (E2E-ST) has received increasing attention due to the potential of its less error propagation, lower latency, and fewer parameters.
We propose a novel non-parametric method that leverages domain-specific text translation corpus to achieve domain adaptation for the E2E-ST system.
arXiv Detail & Related papers (2022-05-23T11:41:02Z) - DEEP: DEnoising Entity Pre-training for Neural Machine Translation [123.6686940355937]
It has been shown that machine translation models usually generate poor translations for named entities that are infrequent in the training corpus.
We propose DEEP, a DEnoising Entity Pre-training method that leverages large amounts of monolingual data and a knowledge base to improve named entity translation accuracy within sentences.
arXiv Detail & Related papers (2021-11-14T17:28:09Z) - Modelling Latent Translations for Cross-Lingual Transfer [47.61502999819699]
We propose a new technique that integrates both steps of the traditional pipeline (translation and classification) into a single model.
We evaluate our novel latent translation-based model on a series of multilingual NLU tasks.
We report gains for both zero-shot and few-shot learning setups, up to 2.7 accuracy points on average.
arXiv Detail & Related papers (2021-07-23T17:11:27Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z) - On Losses for Modern Language Models [18.56205816291398]
We show that NSP is detrimental to training due to its context splitting and shallow semantic signal.
Using multiple tasks in a multi-task pre-training framework provides better results than using any single auxiliary task.
arXiv Detail & Related papers (2020-10-04T21:44:15Z) - Multilingual Denoising Pre-training for Neural Machine Translation [132.66750663226287]
mBART is a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora.
mBART is one of the first methods for pre-training a complete sequence-to-sequence model.
arXiv Detail & Related papers (2020-01-22T18:59:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.