Recurrent multiple shared layers in Depth for Neural Machine Translation
- URL: http://arxiv.org/abs/2108.10417v2
- Date: Thu, 26 Aug 2021 13:32:50 GMT
- Title: Recurrent multiple shared layers in Depth for Neural Machine Translation
- Authors: GuoLiang Li and Yiyang Li
- Abstract summary: We propose to train a deeper model with recurrent mechanism, which loops the encoder and decoder blocks of Transformer in the depth direction.
Compared to the deep Transformer(20-layer encoder, 6-layer decoder), our model has similar model performance and infer speed, but our model parameters are 54.72% of the former.
- Score: 11.660776324473645
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Learning deeper models is usually a simple and effective approach to improve
model performance, but deeper models have larger model parameters and are more
difficult to train. To get a deeper model, simply stacking more layers of the
model seems to work well, but previous works have claimed that it cannot
benefit the model. We propose to train a deeper model with recurrent mechanism,
which loops the encoder and decoder blocks of Transformer in the depth
direction. To address the increasing of model parameters, we choose to share
parameters in different recursive moments. We conduct our experiments on WMT16
English-to-German and WMT14 English-to-France translation tasks, our model
outperforms the shallow Transformer-Base/Big baseline by 0.35, 1.45 BLEU
points, which is 27.23% of Transformer-Big model parameters. Compared to the
deep Transformer(20-layer encoder, 6-layer decoder), our model has similar
model performance and infer speed, but our model parameters are 54.72% of the
former.
Related papers
- Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models [6.809572275782338]
We develop a unified signal propagation theory and provide formulae that govern the moments of the forward and backward signal through the transformer model.
Our framework can be used to understand and mitigate vanishing/exploding gradients, rank collapse, and instability associated with high attention scores.
arXiv Detail & Related papers (2024-03-14T17:59:14Z) - Understanding Parameter Sharing in Transformers [53.75988363281843]
Previous work on Transformers has focused on sharing parameters in different layers, which can improve the performance of models with limited parameters by increasing model depth.
We show that the success of this approach can be largely attributed to better convergence, with only a small part due to the increased model complexity.
Experiments on 8 machine translation tasks show that our model achieves competitive performance with only half the model complexity of parameter sharing models.
arXiv Detail & Related papers (2023-06-15T10:48:59Z) - Multi-Path Transformer is Better: A Case Study on Neural Machine
Translation [35.67070351304121]
We study how model width affects the Transformer model through a parameter-efficient multi-path structure.
Experiments on 12 WMT machine translation tasks show that, with the same number of parameters, the shallower multi-path model can achieve similar or even better performance than the deeper model.
arXiv Detail & Related papers (2023-05-10T07:39:57Z) - Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud
Scale Production [7.056223012587321]
We introduce a highly efficient inference framework with several optimization approaches to accelerate the computation of sparse models.
We are able to deploy 136x larger models with 27% less cost and significantly better quality compared to the existing solutions.
arXiv Detail & Related papers (2022-11-18T03:43:52Z) - DeepNet: Scaling Transformers to 1,000 Layers [106.33669415337135]
We introduce a new normalization function (DeepNorm) to modify the residual connection in Transformer.
In-depth theoretical analysis shows that model updates can be bounded in a stable way.
We successfully scale Transformers up to 1,000 layers without difficulty, which is one order of magnitude deeper than previous deep Transformers.
arXiv Detail & Related papers (2022-03-01T15:36:38Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z) - Go Wider Instead of Deeper [11.4541055228727]
We propose a framework to deploy trainable parameters efficiently, by going wider instead of deeper.
Our best model outperforms Vision Transformer (ViT) by $1.46%$ with $0.72 times$ trainable parameters.
Our framework can still surpass ViT and ViT-MoE by $0.83%$ and $2.08%$, respectively.
arXiv Detail & Related papers (2021-07-25T14:44:24Z) - Exploring Sparse Expert Models and Beyond [51.90860155810848]
Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost.
We propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing.
This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models.
arXiv Detail & Related papers (2021-05-31T16:12:44Z) - Very Deep Transformers for Neural Machine Translation [100.51465892354234]
We show that it is feasible to build standard Transformer-based models with up to 60 encoder layers and 12 decoder layers.
These deep models outperform their baseline 6-layer counterparts by as much as 2.5 BLEU.
arXiv Detail & Related papers (2020-08-18T07:14:54Z) - Train Large, Then Compress: Rethinking Model Size for Efficient Training
and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute.
We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps.
This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.