Shallow-to-Deep Training for Neural Machine Translation
- URL: http://arxiv.org/abs/2010.03737v1
- Date: Thu, 8 Oct 2020 02:36:07 GMT
- Title: Shallow-to-Deep Training for Neural Machine Translation
- Authors: Bei Li, Ziyang Wang, Hui Liu, Yufan Jiang, Quan Du, Tong Xiao, Huizhen
Wang and Jingbo Zhu
- Abstract summary: In this paper, we investigate the behavior of a well-tuned deep Transformer system.
We find that stacking layers is helpful in improving the representation ability of NMT models.
This inspires us to develop a shallow-to-deep training method that learns deep models by stacking shallow models.
- Score: 42.62107851930165
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep encoders have been proven to be effective in improving neural machine
translation (NMT) systems, but training an extremely deep encoder is time
consuming. Moreover, why deep models help NMT is an open question. In this
paper, we investigate the behavior of a well-tuned deep Transformer system. We
find that stacking layers is helpful in improving the representation ability of
NMT models and adjacent layers perform similarly. This inspires us to develop a
shallow-to-deep training method that learns deep models by stacking shallow
models. In this way, we successfully train a Transformer system with a 54-layer
encoder. Experimental results on WMT'16 English-German and WMT'14
English-French translation tasks show that it is $1.4$ $\times$ faster than
training from scratch, and achieves a BLEU score of $30.33$ and $43.29$ on two
tasks. The code is publicly available at
https://github.com/libeineu/SDT-Training/.
Related papers
- The NiuTrans System for WNGT 2020 Efficiency Task [32.88733142090084]
This paper describes the submissions of the NiuTrans Team to the WNGT 2020 Efficiency Shared Task.
We focus on the efficient implementation of deep Transformer models using NiuTensor, a flexible toolkit for NLP tasks.
arXiv Detail & Related papers (2021-09-16T14:32:01Z) - Efficient Inference for Multilingual Neural Machine Translation [60.10996883354372]
We consider several ways to make multilingual NMT faster at inference without degrading its quality.
Our experiments demonstrate that combining a shallow decoder with vocabulary filtering leads to more than twice faster inference with no loss in translation quality.
arXiv Detail & Related papers (2021-09-14T13:28:13Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z) - Dynamic Multi-Branch Layers for On-Device Neural Machine Translation [53.637479651600586]
We propose to improve the performance of on-device neural machine translation (NMT) systems with dynamic multi-branch layers.
Specifically, we design a layer-wise dynamic multi-branch network with only one branch activated during training and inference.
At almost the same computational cost, our method achieves improvements of up to 1.7 BLEU points on the WMT14 English-German translation task and 1.8 BLEU points on the WMT20 Chinese-English translation task.
arXiv Detail & Related papers (2021-05-14T07:32:53Z) - Learning Light-Weight Translation Models from Deep Transformer [25.386460662408773]
We propose a novel group-permutation based knowledge distillation approach to compressing the deep Transformer model into a shallow model.
Our compressed model is 8X shallower than the deep model, with almost no loss in BLEU.
To further enhance the teacher model, we present a Skipping Sub-Layer method to randomly omit sub-layers to introduce perturbation into training.
arXiv Detail & Related papers (2020-12-27T05:33:21Z) - Very Deep Transformers for Neural Machine Translation [100.51465892354234]
We show that it is feasible to build standard Transformer-based models with up to 60 encoder layers and 12 decoder layers.
These deep models outperform their baseline 6-layer counterparts by as much as 2.5 BLEU.
arXiv Detail & Related papers (2020-08-18T07:14:54Z) - Norm-Based Curriculum Learning for Neural Machine Translation [45.37588885850862]
A neural machine translation (NMT) system is expensive to train, especially with high-resource settings.
In this paper, we aim to improve the efficiency of training an NMT by introducing a novel norm-based curriculum learning method.
The proposed method outperforms strong baselines in terms of BLEU score (+1.17/+1.56) and training speedup (2.22x/3.33x)
arXiv Detail & Related papers (2020-06-03T02:22:00Z) - Multiscale Collaborative Deep Models for Neural Machine Translation [40.52423993051359]
We present a MultiScale Collaborative (MSC) framework to ease the training of NMT models that are substantially deeper than those used previously.
We explicitly boost the gradient back-propagation from top to bottom levels by introducing a block-scale collaboration mechanism into deep NMT models.
Our deep MSC achieves a BLEU score of 30.56 on WMT14 English-German task that significantly outperforms state-of-the-art deep NMT models.
arXiv Detail & Related papers (2020-04-29T08:36:08Z) - Neural Machine Translation: Challenges, Progress and Future [62.75523637241876]
Machine translation (MT) is a technique that leverages computers to translate human languages automatically.
neural machine translation (NMT) models direct mapping between source and target languages with deep neural networks.
This article makes a review of NMT framework, discusses the challenges in NMT and introduces some exciting recent progresses.
arXiv Detail & Related papers (2020-04-13T07:53:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.