ResiDual: Transformer with Dual Residual Connections
- URL: http://arxiv.org/abs/2304.14802v1
- Date: Fri, 28 Apr 2023 12:19:47 GMT
- Title: ResiDual: Transformer with Dual Residual Connections
- Authors: Shufang Xie, Huishuai Zhang, Junliang Guo, Xu Tan, Jiang Bian, Hany
Hassan Awadalla, Arul Menezes, Tao Qin, Rui Yan
- Abstract summary: Two widely used variants are the Post-Layer-Normalization (Post-LN) and Pre-Layer-Normalization (Pre-LN)
Post-LN causes gradient vanishing issue that hinders training deep Transformers, and Pre-LN causes representation collapse issue that limits model capacity.
We propose ResiDual, a novel Transformer architecture with Pre-Post-LN (PPLN), which fuses connections in Post-LN Pre-LN together.
- Score: 106.38073506751003
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer networks have become the preferred architecture for many tasks
due to their state-of-the-art performance. However, the optimal way to
implement residual connections in Transformer, which are essential for
effective training, is still debated. Two widely used variants are the
Post-Layer-Normalization (Post-LN) and Pre-Layer-Normalization (Pre-LN)
Transformers, which apply layer normalization after each residual block's
output or before each residual block's input, respectively. While both variants
enjoy their advantages, they also suffer from severe limitations: Post-LN
causes gradient vanishing issue that hinders training deep Transformers, and
Pre-LN causes representation collapse issue that limits model capacity. In this
paper, we propose ResiDual, a novel Transformer architecture with Pre-Post-LN
(PPLN), which fuses the connections in Post-LN and Pre-LN together and inherits
their advantages while avoids their limitations. We conduct both theoretical
analyses and empirical experiments to verify the effectiveness of ResiDual.
Theoretically, we prove that ResiDual has a lower bound on the gradient to
avoid the vanishing issue due to the residual connection from Pre-LN. Moreover,
ResiDual also has diverse model representations to avoid the collapse issue due
to the residual connection from Post-LN. Empirically, ResiDual outperforms both
Post-LN and Pre-LN on several machine translation benchmarks across different
network depths and data sizes. Thanks to the good theoretical and empirical
performance, ResiDual Transformer can serve as a foundation architecture for
different AI models (e.g., large language models). Our code is available at
https://github.com/microsoft/ResiDual.
Related papers
- Were RNNs All We Needed? [53.393497486332]
We revisit traditional recurrent neural networks (RNNs) from over a decade ago.
We show that by removing their hidden state dependencies from their input, forget, and update gates, LSTMs and GRUs no longer need to BPTT and can be efficiently trained in parallel.
arXiv Detail & Related papers (2024-10-02T03:06:49Z) - Practical Computational Power of Linear Transformers and Their Recurrent
and Self-Referential Extensions [15.793406740545024]
We study auto-regressive Transformers with linearised attention, a.k.a. linear Transformers (LTs) or Fast Weight Programmers (FWPs)
LTs are special in the sense that they are equivalent to RNN-like sequence processors with a fixed-size state, while they can also be expressed as the now-popular self-attention networks.
arXiv Detail & Related papers (2023-10-24T17:17:01Z) - Tangent Transformers for Composition, Privacy and Removal [58.280295030852194]
Tangent Attention Fine-Tuning (TAFT) is a method for fine-tuning linearized transformers.
Tangent Attention Fine-Tuning (TAFT) is a method for fine-tuning linearized transformers.
arXiv Detail & Related papers (2023-07-16T18:31:25Z) - Re^2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal
Action Localization [65.33914980022303]
Temporal action localization (TAL) requires long-form reasoning to predict actions of various durations and complex content.
Most methods can only train on pre-extracted features without optimizing them for the localization problem.
We propose a novel end-to-end method Re2TAL, which rewires pretrained video backbones for reversible TAL.
arXiv Detail & Related papers (2022-11-25T12:17:30Z) - Characterization of anomalous diffusion through convolutional
transformers [0.8984888893275713]
We propose a new transformer based neural network architecture for the characterization of anomalous diffusion.
Our new architecture, the Convolutional Transformer (ConvTransformer), uses a bi-layered convolutional neural network to extract features from our diffusive trajectories.
We show that the ConvTransformer is able to outperform the previous state of the art at determining the underlying diffusive regime in short trajectories.
arXiv Detail & Related papers (2022-10-10T18:53:13Z) - Unified Normalization for Accelerating and Stabilizing Transformers [35.07454490355906]
Layer Normalization (LN) normalizes activations within each token to boost robustness.
LN requires on-the-fly statistics calculation in inference as well as division and square root operations.
We propose Unified Normalization (UN), which can speed up the inference by being fused with other linear operations.
arXiv Detail & Related papers (2022-08-02T08:41:31Z) - B2T Connection: Serving Stability and Performance in Deep Transformers [40.44674210101826]
Recent Transformers tend to be Pre-LN because, in Post-LN with deep Transformers, the training is often unstable, resulting in useless models.
Post-LN has consistently achieved better performance than Pre-LN in relatively shallow Transformers.
We propose a method that can provide both high stability and effective training by a simple modification of Post-LN.
arXiv Detail & Related papers (2022-06-01T08:43:20Z) - Learning Bounded Context-Free-Grammar via LSTM and the
Transformer:Difference and Explanations [51.77000472945441]
Long Short-Term Memory (LSTM) and Transformers are two popular neural architectures used for natural language processing tasks.
In practice, it is often observed that Transformer models have better representation power than LSTM.
We study such practical differences between LSTM and Transformer and propose an explanation based on their latent space decomposition patterns.
arXiv Detail & Related papers (2021-12-16T19:56:44Z) - ENCONTER: Entity Constrained Progressive Sequence Generation via
Insertion-based Transformer [11.310502327308575]
Autoregressive language models do not perform well under hard lexical constraints.
Progressive insertion-based transformers can overcome this limitation.
The paper proposes the Entity-constrained insertion transformer (ENCONTER)
Our experiments show that ENCONTER outperforms other baseline models in several performance metrics.
arXiv Detail & Related papers (2021-03-17T10:24:10Z) - Neural Networks are Convex Regularizers: Exact Polynomial-time Convex
Optimization Formulations for Two-layer Networks [70.15611146583068]
We develop exact representations of training two-layer neural networks with rectified linear units (ReLUs)
Our theory utilizes semi-infinite duality and minimum norm regularization.
arXiv Detail & Related papers (2020-02-24T21:32:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.