B2T Connection: Serving Stability and Performance in Deep Transformers
- URL: http://arxiv.org/abs/2206.00330v2
- Date: Fri, 26 May 2023 09:16:22 GMT
- Title: B2T Connection: Serving Stability and Performance in Deep Transformers
- Authors: Sho Takase, Shun Kiyono, Sosuke Kobayashi, Jun Suzuki
- Abstract summary: Recent Transformers tend to be Pre-LN because, in Post-LN with deep Transformers, the training is often unstable, resulting in useless models.
Post-LN has consistently achieved better performance than Pre-LN in relatively shallow Transformers.
We propose a method that can provide both high stability and effective training by a simple modification of Post-LN.
- Score: 40.44674210101826
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: From the perspective of the layer normalization (LN) positions, the
architectures of Transformers can be categorized into two types: Post-LN and
Pre-LN. Recent Transformers tend to be Pre-LN because, in Post-LN with deep
Transformers (e.g., those with ten or more layers), the training is often
unstable, resulting in useless models. However, Post-LN has consistently
achieved better performance than Pre-LN in relatively shallow Transformers
(e.g., those with six or fewer layers). This study first investigates the
reason for these discrepant observations empirically and theoretically and made
the following discoveries: 1, the LN in Post-LN is the main source of the
vanishing gradient problem that leads to unstable training, whereas Pre-LN
prevents it, and 2, Post-LN tends to preserve larger gradient norms in higher
layers during the back-propagation, which may lead to effective training.
Exploiting the new findings, we propose a method that can provide both high
stability and effective training by a simple modification of Post-LN. We
conduct experiments on a wide range of text generation tasks. The experimental
results demonstrate that our method outperforms Pre-LN, and enables stable
training regardless of the shallow or deep layer settings. Our code is publicly
available at https://github.com/takase/b2t_connection.
Related papers
- You can remove GPT2's LayerNorm by fine-tuning [0.0]
LayerNorm (LN) layer in GPT-style transformer models has long been a hindrance to mechanistic interpretability.
LN is a crucial component required to stabilize the training of large language models.
We show that it is possible to remove the LN layers from a pre-trained GPT2-small model by fine-tuning on a fraction (500M tokens) of the training data.
arXiv Detail & Related papers (2024-09-06T16:17:06Z) - ResiDual: Transformer with Dual Residual Connections [106.38073506751003]
Two widely used variants are the Post-Layer-Normalization (Post-LN) and Pre-Layer-Normalization (Pre-LN)
Post-LN causes gradient vanishing issue that hinders training deep Transformers, and Pre-LN causes representation collapse issue that limits model capacity.
We propose ResiDual, a novel Transformer architecture with Pre-Post-LN (PPLN), which fuses connections in Post-LN Pre-LN together.
arXiv Detail & Related papers (2023-04-28T12:19:47Z) - Unified Normalization for Accelerating and Stabilizing Transformers [35.07454490355906]
Layer Normalization (LN) normalizes activations within each token to boost robustness.
LN requires on-the-fly statistics calculation in inference as well as division and square root operations.
We propose Unified Normalization (UN), which can speed up the inference by being fused with other linear operations.
arXiv Detail & Related papers (2022-08-02T08:41:31Z) - DeepNet: Scaling Transformers to 1,000 Layers [106.33669415337135]
We introduce a new normalization function (DeepNorm) to modify the residual connection in Transformer.
In-depth theoretical analysis shows that model updates can be bounded in a stable way.
We successfully scale Transformers up to 1,000 layers without difficulty, which is one order of magnitude deeper than previous deep Transformers.
arXiv Detail & Related papers (2022-03-01T15:36:38Z) - Understanding the Difficulty of Training Transformers [120.99980924577787]
We show that unbalanced gradients are not the root cause of the instability of training.
We propose Admin to stabilize the early stage's training and unleash its full potential in the late stage.
arXiv Detail & Related papers (2020-04-17T13:59:07Z) - PowerNorm: Rethinking Batch Normalization in Transformers [96.14956636022957]
normalization method for neural network (NN) models used in Natural Language Processing (NLP) is layer normalization (LN)
LN is preferred due to the empirical observation that a (naive/vanilla) use of BN leads to significant performance degradation for NLP tasks.
We propose Power Normalization (PN), a novel normalization scheme that resolves this issue.
arXiv Detail & Related papers (2020-03-17T17:50:26Z) - On Layer Normalization in the Transformer Architecture [112.40350994368741]
We first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters.
We show in experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines.
arXiv Detail & Related papers (2020-02-12T00:33:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.