Related papers: ResiDual: Transformer with Dual Residual Connections

ResiDual: Transformer with Dual Residual Connections

URL: http://arxiv.org/abs/2304.14802v1
Date: Fri, 28 Apr 2023 12:19:47 GMT
Title: ResiDual: Transformer with Dual Residual Connections
Authors: Shufang Xie, Huishuai Zhang, Junliang Guo, Xu Tan, Jiang Bian, Hany Hassan Awadalla, Arul Menezes, Tao Qin, Rui Yan
Abstract summary: Two widely used variants are the Post-Layer-Normalization (Post-LN) and Pre-Layer-Normalization (Pre-LN) Post-LN causes gradient vanishing issue that hinders training deep Transformers, and Pre-LN causes representation collapse issue that limits model capacity. We propose ResiDual, a novel Transformer architecture with Pre-Post-LN (PPLN), which fuses connections in Post-LN Pre-LN together.
Score: 106.38073506751003
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer networks have become the preferred architecture for many tasks due to their state-of-the-art performance. However, the optimal way to implement residual connections in Transformer, which are essential for effective training, is still debated. Two widely used variants are the Post-Layer-Normalization (Post-LN) and Pre-Layer-Normalization (Pre-LN) Transformers, which apply layer normalization after each residual block's output or before each residual block's input, respectively. While both variants enjoy their advantages, they also suffer from severe limitations: Post-LN causes gradient vanishing issue that hinders training deep Transformers, and Pre-LN causes representation collapse issue that limits model capacity. In this paper, we propose ResiDual, a novel Transformer architecture with Pre-Post-LN (PPLN), which fuses the connections in Post-LN and Pre-LN together and inherits their advantages while avoids their limitations. We conduct both theoretical analyses and empirical experiments to verify the effectiveness of ResiDual. Theoretically, we prove that ResiDual has a lower bound on the gradient to avoid the vanishing issue due to the residual connection from Pre-LN. Moreover, ResiDual also has diverse model representations to avoid the collapse issue due to the residual connection from Post-LN. Empirically, ResiDual outperforms both Post-LN and Pre-LN on several machine translation benchmarks across different network depths and data sizes. Thanks to the good theoretical and empirical performance, ResiDual Transformer can serve as a foundation architecture for different AI models (e.g., large language models). Our code is available at https://github.com/microsoft/ResiDual.

Related papers

Peri-LN: Revisiting Layer Normalization in the Transformer Architecture [57.08322913112157]
Pre-LN and Post-LN have long dominated standard practices despite their limitations in large-scale training. Several open-source large-scale models have recently begun silently adopting a third strategy without much explanation. We show that Peri-LN consistently achieves more balanced variance growth, steadier gradient flow, and convergence stability.
arXiv Detail & Related papers (2025-02-04T21:29:47Z)
Were RNNs All We Needed? [53.393497486332]
We revisit traditional recurrent neural networks (RNNs) from over a decade ago. We show that by removing their hidden state dependencies from their input, forget, and update gates, LSTMs and GRUs no longer need to BPTT and can be efficiently trained in parallel.
arXiv Detail & Related papers (2024-10-02T03:06:49Z)
Practical Computational Power of Linear Transformers and Their Recurrent and Self-Referential Extensions [15.793406740545024]
We study auto-regressive Transformers with linearised attention, a.k.a. linear Transformers (LTs) or Fast Weight Programmers (FWPs) LTs are special in the sense that they are equivalent to RNN-like sequence processors with a fixed-size state, while they can also be expressed as the now-popular self-attention networks.
arXiv Detail & Related papers (2023-10-24T17:17:01Z)
Tangent Transformers for Composition, Privacy and Removal [58.280295030852194]
Tangent Attention Fine-Tuning (TAFT) is a method for fine-tuning linearized transformers. Tangent Attention Fine-Tuning (TAFT) is a method for fine-tuning linearized transformers.
arXiv Detail & Related papers (2023-07-16T18:31:25Z)
Re^2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization [65.33914980022303]
Temporal action localization (TAL) requires long-form reasoning to predict actions of various durations and complex content. Most methods can only train on pre-extracted features without optimizing them for the localization problem. We propose a novel end-to-end method Re2TAL, which rewires pretrained video backbones for reversible TAL.
arXiv Detail & Related papers (2022-11-25T12:17:30Z)
Characterization of anomalous diffusion through convolutional transformers [0.8984888893275713]
We propose a new transformer based neural network architecture for the characterization of anomalous diffusion. Our new architecture, the Convolutional Transformer (ConvTransformer), uses a bi-layered convolutional neural network to extract features from our diffusive trajectories. We show that the ConvTransformer is able to outperform the previous state of the art at determining the underlying diffusive regime in short trajectories.
arXiv Detail & Related papers (2022-10-10T18:53:13Z)
Unified Normalization for Accelerating and Stabilizing Transformers [35.07454490355906]
Layer Normalization (LN) normalizes activations within each token to boost robustness. LN requires on-the-fly statistics calculation in inference as well as division and square root operations. We propose Unified Normalization (UN), which can speed up the inference by being fused with other linear operations.
arXiv Detail & Related papers (2022-08-02T08:41:31Z)
B2T Connection: Serving Stability and Performance in Deep Transformers [40.44674210101826]
Recent Transformers tend to be Pre-LN because, in Post-LN with deep Transformers, the training is often unstable, resulting in useless models. Post-LN has consistently achieved better performance than Pre-LN in relatively shallow Transformers. We propose a method that can provide both high stability and effective training by a simple modification of Post-LN.
arXiv Detail & Related papers (2022-06-01T08:43:20Z)
Learning Bounded Context-Free-Grammar via LSTM and the Transformer:Difference and Explanations [51.77000472945441]
Long Short-Term Memory (LSTM) and Transformers are two popular neural architectures used for natural language processing tasks. In practice, it is often observed that Transformer models have better representation power than LSTM. We study such practical differences between LSTM and Transformer and propose an explanation based on their latent space decomposition patterns.
arXiv Detail & Related papers (2021-12-16T19:56:44Z)
ENCONTER: Entity Constrained Progressive Sequence Generation via Insertion-based Transformer [11.310502327308575]
Autoregressive language models do not perform well under hard lexical constraints. Progressive insertion-based transformers can overcome this limitation. The paper proposes the Entity-constrained insertion transformer (ENCONTER) Our experiments show that ENCONTER outperforms other baseline models in several performance metrics.
arXiv Detail & Related papers (2021-03-17T10:24:10Z)
Neural Networks are Convex Regularizers: Exact Polynomial-time Convex Optimization Formulations for Two-layer Networks [70.15611146583068]
We develop exact representations of training two-layer neural networks with rectified linear units (ReLUs) Our theory utilizes semi-infinite duality and minimum norm regularization.
arXiv Detail & Related papers (2020-02-24T21:32:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.