ResiDual: Transformer with Dual Residual Connections
- URL: http://arxiv.org/abs/2304.14802v1
- Date: Fri, 28 Apr 2023 12:19:47 GMT
- Title: ResiDual: Transformer with Dual Residual Connections
- Authors: Shufang Xie, Huishuai Zhang, Junliang Guo, Xu Tan, Jiang Bian, Hany
Hassan Awadalla, Arul Menezes, Tao Qin, Rui Yan
- Abstract summary: Two widely used variants are the Post-Layer-Normalization (Post-LN) and Pre-Layer-Normalization (Pre-LN)
Post-LN causes gradient vanishing issue that hinders training deep Transformers, and Pre-LN causes representation collapse issue that limits model capacity.
We propose ResiDual, a novel Transformer architecture with Pre-Post-LN (PPLN), which fuses connections in Post-LN Pre-LN together.
- Score: 106.38073506751003
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer networks have become the preferred architecture for many tasks
due to their state-of-the-art performance. However, the optimal way to
implement residual connections in Transformer, which are essential for
effective training, is still debated. Two widely used variants are the
Post-Layer-Normalization (Post-LN) and Pre-Layer-Normalization (Pre-LN)
Transformers, which apply layer normalization after each residual block's
output or before each residual block's input, respectively. While both variants
enjoy their advantages, they also suffer from severe limitations: Post-LN
causes gradient vanishing issue that hinders training deep Transformers, and
Pre-LN causes representation collapse issue that limits model capacity. In this
paper, we propose ResiDual, a novel Transformer architecture with Pre-Post-LN
(PPLN), which fuses the connections in Post-LN and Pre-LN together and inherits
their advantages while avoids their limitations. We conduct both theoretical
analyses and empirical experiments to verify the effectiveness of ResiDual.
Theoretically, we prove that ResiDual has a lower bound on the gradient to
avoid the vanishing issue due to the residual connection from Pre-LN. Moreover,
ResiDual also has diverse model representations to avoid the collapse issue due
to the residual connection from Post-LN. Empirically, ResiDual outperforms both
Post-LN and Pre-LN on several machine translation benchmarks across different
network depths and data sizes. Thanks to the good theoretical and empirical
performance, ResiDual Transformer can serve as a foundation architecture for
different AI models (e.g., large language models). Our code is available at
https://github.com/microsoft/ResiDual.
Related papers
- Peri-LN: Revisiting Layer Normalization in the Transformer Architecture [57.08322913112157]
Pre-LN and Post-LN have long dominated standard practices despite their limitations in large-scale training.
Several open-source large-scale models have recently begun silently adopting a third strategy without much explanation.
We show that Peri-LN consistently achieves more balanced variance growth, steadier gradient flow, and convergence stability.
arXiv Detail & Related papers (2025-02-04T21:29:47Z) - Were RNNs All We Needed? [55.822693848969855]
In this work, we revisit sequence modelling from a historical perspective, focusing on Recurrent Neural Networks (RNNs)
We demonstrate that by simplifying these models, we can derive minimal versions (minLSTMs and minGRUs) that use fewer parameters than their traditional counterparts, are fully parallelizable during training, and achieve surprisingly competitive performance on a range of tasks, rivalling recent models including Transformers.
arXiv Detail & Related papers (2024-10-02T03:06:49Z) - Practical Computational Power of Linear Transformers and Their Recurrent
and Self-Referential Extensions [15.793406740545024]
We study auto-regressive Transformers with linearised attention, a.k.a. linear Transformers (LTs) or Fast Weight Programmers (FWPs)
LTs are special in the sense that they are equivalent to RNN-like sequence processors with a fixed-size state, while they can also be expressed as the now-popular self-attention networks.
arXiv Detail & Related papers (2023-10-24T17:17:01Z) - Re^2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal
Action Localization [65.33914980022303]
Temporal action localization (TAL) requires long-form reasoning to predict actions of various durations and complex content.
Most methods can only train on pre-extracted features without optimizing them for the localization problem.
We propose a novel end-to-end method Re2TAL, which rewires pretrained video backbones for reversible TAL.
arXiv Detail & Related papers (2022-11-25T12:17:30Z) - Characterization of anomalous diffusion through convolutional
transformers [0.8984888893275713]
We propose a new transformer based neural network architecture for the characterization of anomalous diffusion.
Our new architecture, the Convolutional Transformer (ConvTransformer), uses a bi-layered convolutional neural network to extract features from our diffusive trajectories.
We show that the ConvTransformer is able to outperform the previous state of the art at determining the underlying diffusive regime in short trajectories.
arXiv Detail & Related papers (2022-10-10T18:53:13Z) - Unified Normalization for Accelerating and Stabilizing Transformers [35.07454490355906]
Layer Normalization (LN) normalizes activations within each token to boost robustness.
LN requires on-the-fly statistics calculation in inference as well as division and square root operations.
We propose Unified Normalization (UN), which can speed up the inference by being fused with other linear operations.
arXiv Detail & Related papers (2022-08-02T08:41:31Z) - B2T Connection: Serving Stability and Performance in Deep Transformers [40.44674210101826]
Recent Transformers tend to be Pre-LN because, in Post-LN with deep Transformers, the training is often unstable, resulting in useless models.
Post-LN has consistently achieved better performance than Pre-LN in relatively shallow Transformers.
We propose a method that can provide both high stability and effective training by a simple modification of Post-LN.
arXiv Detail & Related papers (2022-06-01T08:43:20Z) - Learning Bounded Context-Free-Grammar via LSTM and the
Transformer:Difference and Explanations [51.77000472945441]
Long Short-Term Memory (LSTM) and Transformers are two popular neural architectures used for natural language processing tasks.
In practice, it is often observed that Transformer models have better representation power than LSTM.
We study such practical differences between LSTM and Transformer and propose an explanation based on their latent space decomposition patterns.
arXiv Detail & Related papers (2021-12-16T19:56:44Z) - ENCONTER: Entity Constrained Progressive Sequence Generation via
Insertion-based Transformer [11.310502327308575]
Autoregressive language models do not perform well under hard lexical constraints.
Progressive insertion-based transformers can overcome this limitation.
The paper proposes the Entity-constrained insertion transformer (ENCONTER)
Our experiments show that ENCONTER outperforms other baseline models in several performance metrics.
arXiv Detail & Related papers (2021-03-17T10:24:10Z) - Neural Networks are Convex Regularizers: Exact Polynomial-time Convex
Optimization Formulations for Two-layer Networks [70.15611146583068]
We develop exact representations of training two-layer neural networks with rectified linear units (ReLUs)
Our theory utilizes semi-infinite duality and minimum norm regularization.
arXiv Detail & Related papers (2020-02-24T21:32:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.