Very Deep Transformers for Neural Machine Translation
- URL: http://arxiv.org/abs/2008.07772v2
- Date: Wed, 14 Oct 2020 22:56:32 GMT
- Title: Very Deep Transformers for Neural Machine Translation
- Authors: Xiaodong Liu, Kevin Duh, Liyuan Liu and Jianfeng Gao
- Abstract summary: We show that it is feasible to build standard Transformer-based models with up to 60 encoder layers and 12 decoder layers.
These deep models outperform their baseline 6-layer counterparts by as much as 2.5 BLEU.
- Score: 100.51465892354234
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We explore the application of very deep Transformer models for Neural Machine
Translation (NMT). Using a simple yet effective initialization technique that
stabilizes training, we show that it is feasible to build standard
Transformer-based models with up to 60 encoder layers and 12 decoder layers.
These deep models outperform their baseline 6-layer counterparts by as much as
2.5 BLEU, and achieve new state-of-the-art benchmark results on WMT14
English-French (43.8 BLEU and 46.4 BLEU with back-translation) and WMT14
English-German (30.1 BLEU).The code and trained models will be publicly
available at: https://github.com/namisan/exdeep-nmt.
Related papers
- GTrans: Grouping and Fusing Transformer Layers for Neural Machine
Translation [107.2752114891855]
Transformer structure, stacked by a sequence of encoder and decoder network layers, achieves significant development in neural machine translation.
We propose the Group-Transformer model (GTrans) that flexibly divides multi-layer representations of both encoder and decoder into different groups and then fuses these group features to generate target words.
arXiv Detail & Related papers (2022-07-29T04:10:36Z) - DeepNet: Scaling Transformers to 1,000 Layers [106.33669415337135]
We introduce a new normalization function (DeepNorm) to modify the residual connection in Transformer.
In-depth theoretical analysis shows that model updates can be bounded in a stable way.
We successfully scale Transformers up to 1,000 layers without difficulty, which is one order of magnitude deeper than previous deep Transformers.
arXiv Detail & Related papers (2022-03-01T15:36:38Z) - Recurrent multiple shared layers in Depth for Neural Machine Translation [11.660776324473645]
We propose to train a deeper model with recurrent mechanism, which loops the encoder and decoder blocks of Transformer in the depth direction.
Compared to the deep Transformer(20-layer encoder, 6-layer decoder), our model has similar model performance and infer speed, but our model parameters are 54.72% of the former.
arXiv Detail & Related papers (2021-08-23T21:21:45Z) - Regularizing Transformers With Deep Probabilistic Layers [62.997667081978825]
In this work, we demonstrate how the inclusion of deep generative models within BERT can bring more versatile models.
We prove its effectiveness not only in Transformers but also in the most relevant encoder-decoder based LM, seq2seq with and without attention.
arXiv Detail & Related papers (2021-08-23T10:17:02Z) - Language Models are Good Translators [63.528370845657896]
We show that a single language model (LM4MT) can achieve comparable performance with strong encoder-decoder NMT models.
Experiments on pivot-based and zero-shot translation tasks show that LM4MT can outperform the encoder-decoder NMT model by a large margin.
arXiv Detail & Related papers (2021-06-25T13:30:29Z) - Learning Light-Weight Translation Models from Deep Transformer [25.386460662408773]
We propose a novel group-permutation based knowledge distillation approach to compressing the deep Transformer model into a shallow model.
Our compressed model is 8X shallower than the deep model, with almost no loss in BLEU.
To further enhance the teacher model, we present a Skipping Sub-Layer method to randomly omit sub-layers to introduce perturbation into training.
arXiv Detail & Related papers (2020-12-27T05:33:21Z) - Rethinking Document-level Neural Machine Translation [73.42052953710605]
We try to answer the question: Is the capacity of current models strong enough for document-level translation?
We observe that the original Transformer with appropriate training techniques can achieve strong results for document translation, even with a length of 2000 words.
arXiv Detail & Related papers (2020-10-18T11:18:29Z) - Shallow-to-Deep Training for Neural Machine Translation [42.62107851930165]
In this paper, we investigate the behavior of a well-tuned deep Transformer system.
We find that stacking layers is helpful in improving the representation ability of NMT models.
This inspires us to develop a shallow-to-deep training method that learns deep models by stacking shallow models.
arXiv Detail & Related papers (2020-10-08T02:36:07Z) - Attention Is All You Need [36.87735219227719]
We propose a new simple network architecture, the Transformer, based solely on attention mechanisms.
Experiments on two machine translation tasks show these models to be superior in quality.
arXiv Detail & Related papers (2017-06-12T17:57:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.