On Layer Normalization in the Transformer Architecture
- URL: http://arxiv.org/abs/2002.04745v2
- Date: Mon, 29 Jun 2020 07:55:12 GMT
- Title: On Layer Normalization in the Transformer Architecture
- Authors: Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen
Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, Tie-Yan Liu
- Abstract summary: We first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters.
We show in experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines.
- Score: 112.40350994368741
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Transformer is widely used in natural language processing tasks. To train
a Transformer however, one usually needs a carefully designed learning rate
warm-up stage, which is shown to be crucial to the final performance but will
slow down the optimization and bring more hyper-parameter tunings. In this
paper, we first study theoretically why the learning rate warm-up stage is
essential and show that the location of layer normalization matters.
Specifically, we prove with mean field theory that at initialization, for the
original-designed Post-LN Transformer, which places the layer normalization
between the residual blocks, the expected gradients of the parameters near the
output layer are large. Therefore, using a large learning rate on those
gradients makes the training unstable. The warm-up stage is practically helpful
for avoiding this problem. On the other hand, our theory also shows that if the
layer normalization is put inside the residual blocks (recently proposed as
Pre-LN Transformer), the gradients are well-behaved at initialization. This
motivates us to remove the warm-up stage for the training of Pre-LN
Transformers. We show in our experiments that Pre-LN Transformers without the
warm-up stage can reach comparable results with baselines while requiring
significantly less training time and hyper-parameter tuning on a wide range of
applications.
Related papers
- Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers [2.1415873597974286]
Per-example gradient norms are a vital ingredient for estimating gradient noise scale (GNS) with minimal variance.
We propose a method with minimal FLOPs in 3D or greater tensor regimes by simultaneously computing the norms while computing the parameter gradients.
We find that the total GNS of contemporary transformer models is predicted well by the GNS of only the normalization layers.
arXiv Detail & Related papers (2024-11-01T19:50:00Z) - Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption.
We analyze how magnitude-based models affect generalization while improving adaption.
We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z) - On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability [34.43255978863601]
Several suggest that transformers learn a mesa-optimizer during autorere training.
We show that a stronger assumption related to the moments of data is the sufficient necessary condition that the learned mesa-optimizer can perform.
arXiv Detail & Related papers (2024-05-27T05:41:06Z) - On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting.
Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z) - Transformers learn to implement preconditioned gradient descent for
in-context learning [41.74394657009037]
Several recent works demonstrate that transformers can implement algorithms like gradient descent.
We ask: Can transformers learn to implement such algorithms by training over random problem instances?
For a transformer with $L$ attention layers, we prove certain critical points of the training objective implement $L$ iterations of preconditioned gradient descent.
arXiv Detail & Related papers (2023-06-01T02:35:57Z) - Deep Transformers without Shortcuts: Modifying Self-attention for
Faithful Signal Propagation [105.22961467028234]
Skip connections and normalisation layers are ubiquitous for the training of Deep Neural Networks (DNNs)
Recent approaches such as Deep Kernel Shaping have made progress towards reducing our reliance on them.
But these approaches are incompatible with the self-attention layers present in transformers.
arXiv Detail & Related papers (2023-02-20T21:26:25Z) - The Expressive Power of Tuning Only the Normalization Layers [5.779559262502591]
Feature normalization transforms such as Batch and Layer-Normalization have become indispensable ingredients of state-of-the-art deep neural networks.
Recent studies on fine-tuning large pretrained models indicate that just tuning the parameters of these affine transforms can achieve high accuracy for downstream tasks.
We show that for random ReLU networks, fine-tuning only its normalization layers can reconstruct any target network that is $O(sqrttextwidth)$ times smaller.
arXiv Detail & Related papers (2023-02-15T20:44:31Z) - Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations.
We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.