Related papers: On Layer Normalization in the Transformer Architecture

On Layer Normalization in the Transformer Architecture

URL: http://arxiv.org/abs/2002.04745v2
Date: Mon, 29 Jun 2020 07:55:12 GMT
Title: On Layer Normalization in the Transformer Architecture
Authors: Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, Tie-Yan Liu
Abstract summary: We first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters. We show in experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines.
Score: 112.40350994368741
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs a carefully designed learning rate warm-up stage, which is shown to be crucial to the final performance but will slow down the optimization and bring more hyper-parameter tunings. In this paper, we first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters. Specifically, we prove with mean field theory that at initialization, for the original-designed Post-LN Transformer, which places the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large. Therefore, using a large learning rate on those gradients makes the training unstable. The warm-up stage is practically helpful for avoiding this problem. On the other hand, our theory also shows that if the layer normalization is put inside the residual blocks (recently proposed as Pre-LN Transformer), the gradients are well-behaved at initialization. This motivates us to remove the warm-up stage for the training of Pre-LN Transformers. We show in our experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines while requiring significantly less training time and hyper-parameter tuning on a wide range of applications.

Related papers

Transformers without Normalization [58.778767721826206]
We introduce Dynamic Tanh (DyT), an element-wise operation $DyT($x$) = tanh(alpha $x$)$, as a drop-in replacement for normalization layers in Transformers. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models.
arXiv Detail & Related papers (2025-03-13T17:59:06Z)
Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers [2.1415873597974286]
Per-example gradient norms are a vital ingredient for estimating gradient noise scale (GNS) with minimal variance. We propose a method with minimal FLOPs in 3D or greater tensor regimes by simultaneously computing the norms while computing the parameter gradients. We find that the total GNS of contemporary transformer models is predicted well by the GNS of only the normalization layers.
arXiv Detail & Related papers (2024-11-01T19:50:00Z)
Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization [88.5582111768376]
We study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model. Our results establish a sharp condition that can distinguish between the small test error phase and the large test error regime, based on the signal-to-noise ratio in the data model.
arXiv Detail & Related papers (2024-09-28T13:24:11Z)
Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption. We analyze how magnitude-based models affect generalization while improving adaption. We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z)
On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability [34.43255978863601]
Several suggest that transformers learn a mesa-optimizer during autorere training. We show that a stronger assumption related to the moments of data is the sufficient necessary condition that the learned mesa-optimizer can perform.
arXiv Detail & Related papers (2024-05-27T05:41:06Z)
On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting. Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z)
Transformers learn to implement preconditioned gradient descent for in-context learning [41.74394657009037]
Several recent works demonstrate that transformers can implement algorithms like gradient descent. We ask: Can transformers learn to implement such algorithms by training over random problem instances? For a transformer with $L$ attention layers, we prove certain critical points of the training objective implement $L$ iterations of preconditioned gradient descent.
arXiv Detail & Related papers (2023-06-01T02:35:57Z)
Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation [105.22961467028234]
Skip connections and normalisation layers are ubiquitous for the training of Deep Neural Networks (DNNs) Recent approaches such as Deep Kernel Shaping have made progress towards reducing our reliance on them. But these approaches are incompatible with the self-attention layers present in transformers.
arXiv Detail & Related papers (2023-02-20T21:26:25Z)
The Expressive Power of Tuning Only the Normalization Layers [5.779559262502591]
Feature normalization transforms such as Batch and Layer-Normalization have become indispensable ingredients of state-of-the-art deep neural networks. Recent studies on fine-tuning large pretrained models indicate that just tuning the parameters of these affine transforms can achieve high accuracy for downstream tasks. We show that for random ReLU networks, fine-tuning only its normalization layers can reconstruct any target network that is $O(sqrttextwidth)$ times smaller.
arXiv Detail & Related papers (2023-02-15T20:44:31Z)
Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations. We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.