Rewiring the Transformer with Depth-Wise LSTMs
- URL: http://arxiv.org/abs/2007.06257v2
- Date: Thu, 4 Apr 2024 07:17:11 GMT
- Title: Rewiring the Transformer with Depth-Wise LSTMs
- Authors: Hongfei Xu, Yang Song, Qiuhui Liu, Josef van Genabith, Deyi Xiong,
- Abstract summary: We present a Transformer with depth-wise LSTMs connecting cascading Transformer layers and sub-layers.
Experiments with the 6-layer Transformer show significant BLEU improvements in both WMT 14 English-German / French tasks and the OPUS-100 many-to-many multilingual NMT task.
- Score: 55.50278212605607
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stacking non-linear layers allows deep neural networks to model complicated functions, and including residual connections in Transformer layers is beneficial for convergence and performance. However, residual connections may make the model "forget" distant layers and fail to fuse information from previous layers effectively. Selectively managing the representation aggregation of Transformer layers may lead to better performance. In this paper, we present a Transformer with depth-wise LSTMs connecting cascading Transformer layers and sub-layers. We show that layer normalization and feed-forward computation within a Transformer layer can be absorbed into depth-wise LSTMs connecting pure Transformer attention layers. Our experiments with the 6-layer Transformer show significant BLEU improvements in both WMT 14 English-German / French tasks and the OPUS-100 many-to-many multilingual NMT task, and our deep Transformer experiments demonstrate the effectiveness of depth-wise LSTM on the convergence and performance of deep Transformers.
Related papers
- Efficient Visual Transformer by Learnable Token Merging [8.905020033545643]
We propose a novel transformer block, Transformer with Learnable Token Merging (LTM), or LTM-Transformer.
LTM-Transformer is compatible with many popular and compact transformer networks.
It renders compact and efficient visual transformers with comparable or much better prediction accuracy than the original visual transformers.
arXiv Detail & Related papers (2024-07-21T17:09:19Z) - CT-MVSNet: Efficient Multi-View Stereo with Cross-scale Transformer [8.962657021133925]
Cross-scale transformer (CT) processes feature representations at different stages without additional computation.
We introduce an adaptive matching-aware transformer (AMT) that employs different interactive attention combinations at multiple scales.
We also present a dual-feature guided aggregation (DFGA) that embeds the coarse global semantic information into the finer cost volume construction.
arXiv Detail & Related papers (2023-12-14T01:33:18Z) - A Neural ODE Interpretation of Transformer Layers [8.839601328192957]
Transformer layers, which use an alternating pattern of multi-head attention and multi-layer perceptron (MLP) layers, provide an effective tool for a variety of machine learning problems.
We build upon this connection and propose a modification of the internal architecture of a transformer layer.
Our experiments show that this simple modification improves the performance of transformer networks in multiple tasks.
arXiv Detail & Related papers (2022-12-12T16:18:58Z) - GTrans: Grouping and Fusing Transformer Layers for Neural Machine
Translation [107.2752114891855]
Transformer structure, stacked by a sequence of encoder and decoder network layers, achieves significant development in neural machine translation.
We propose the Group-Transformer model (GTrans) that flexibly divides multi-layer representations of both encoder and decoder into different groups and then fuses these group features to generate target words.
arXiv Detail & Related papers (2022-07-29T04:10:36Z) - Towards Lightweight Transformer via Group-wise Transformation for
Vision-and-Language Tasks [126.33843752332139]
We introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer.
We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets.
Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks.
arXiv Detail & Related papers (2022-04-16T11:30:26Z) - Scalable Transformers for Neural Machine Translation [86.4530299266897]
Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation.
We propose a novel scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters.
A three-stage training scheme is proposed to tackle the difficulty of training the scalable Transformers.
arXiv Detail & Related papers (2021-06-04T04:04:10Z) - Multi-Pass Transformer for Machine Translation [51.867982400693194]
We consider a multi-pass transformer (MPT) architecture in which earlier layers are allowed to process information in light of the output of later layers.
MPT can surpass the performance of Large Transformer on the challenging machine translation En-De and En-Fr datasets.
In the hard connection case, the optimal connection pattern found for En-De also leads to improved performance for En-Fr.
arXiv Detail & Related papers (2020-09-23T21:22:15Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.