Related papers: B2T Connection: Serving Stability and Performance in Deep Transformers

B2T Connection: Serving Stability and Performance in Deep Transformers

URL: http://arxiv.org/abs/2206.00330v2
Date: Fri, 26 May 2023 09:16:22 GMT
Title: B2T Connection: Serving Stability and Performance in Deep Transformers
Authors: Sho Takase, Shun Kiyono, Sosuke Kobayashi, Jun Suzuki
Abstract summary: Recent Transformers tend to be Pre-LN because, in Post-LN with deep Transformers, the training is often unstable, resulting in useless models. Post-LN has consistently achieved better performance than Pre-LN in relatively shallow Transformers. We propose a method that can provide both high stability and effective training by a simple modification of Post-LN.
Score: 40.44674210101826
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: From the perspective of the layer normalization (LN) positions, the architectures of Transformers can be categorized into two types: Post-LN and Pre-LN. Recent Transformers tend to be Pre-LN because, in Post-LN with deep Transformers (e.g., those with ten or more layers), the training is often unstable, resulting in useless models. However, Post-LN has consistently achieved better performance than Pre-LN in relatively shallow Transformers (e.g., those with six or fewer layers). This study first investigates the reason for these discrepant observations empirically and theoretically and made the following discoveries: 1, the LN in Post-LN is the main source of the vanishing gradient problem that leads to unstable training, whereas Pre-LN prevents it, and 2, Post-LN tends to preserve larger gradient norms in higher layers during the back-propagation, which may lead to effective training. Exploiting the new findings, we propose a method that can provide both high stability and effective training by a simple modification of Post-LN. We conduct experiments on a wide range of text generation tasks. The experimental results demonstrate that our method outperforms Pre-LN, and enables stable training regardless of the shallow or deep layer settings. Our code is publicly available at https://github.com/takase/b2t_connection.

Related papers

Peri-LN: Revisiting Layer Normalization in the Transformer Architecture [57.08322913112157]
Pre-LN and Post-LN have long dominated standard practices despite their limitations in large-scale training. Several open-source large-scale models have recently begun silently adopting a third strategy without much explanation. We show that Peri-LN consistently achieves more balanced variance growth, steadier gradient flow, and convergence stability.
arXiv Detail & Related papers (2025-02-04T21:29:47Z)
Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN [19.776151399951672]
Mix-LN is a novel normalization technique that combines the strengths of Pre-LN and Post-LN within the same model. Experiments with various model sizes from 70M to 7B demonstrate that Mix-LN consistently outperforms both Pre-LN and Post-LN.
arXiv Detail & Related papers (2024-12-18T12:39:53Z)
You can remove GPT2's LayerNorm by fine-tuning [0.0]
LayerNorm (LN) layer in GPT-style transformer models has long been a hindrance to mechanistic interpretability. LN is a crucial component required to stabilize the training of large language models. We show that it is possible to remove the LN layers from a pre-trained GPT2-small model by fine-tuning on a fraction (500M tokens) of the training data.
arXiv Detail & Related papers (2024-09-06T16:17:06Z)
ResiDual: Transformer with Dual Residual Connections [106.38073506751003]
Two widely used variants are the Post-Layer-Normalization (Post-LN) and Pre-Layer-Normalization (Pre-LN) Post-LN causes gradient vanishing issue that hinders training deep Transformers, and Pre-LN causes representation collapse issue that limits model capacity. We propose ResiDual, a novel Transformer architecture with Pre-Post-LN (PPLN), which fuses connections in Post-LN Pre-LN together.
arXiv Detail & Related papers (2023-04-28T12:19:47Z)
Unified Normalization for Accelerating and Stabilizing Transformers [35.07454490355906]
Layer Normalization (LN) normalizes activations within each token to boost robustness. LN requires on-the-fly statistics calculation in inference as well as division and square root operations. We propose Unified Normalization (UN), which can speed up the inference by being fused with other linear operations.
arXiv Detail & Related papers (2022-08-02T08:41:31Z)
DeepNet: Scaling Transformers to 1,000 Layers [106.33669415337135]
We introduce a new normalization function (DeepNorm) to modify the residual connection in Transformer. In-depth theoretical analysis shows that model updates can be bounded in a stable way. We successfully scale Transformers up to 1,000 layers without difficulty, which is one order of magnitude deeper than previous deep Transformers.
arXiv Detail & Related papers (2022-03-01T15:36:38Z)
Understanding the Difficulty of Training Transformers [120.99980924577787]
We show that unbalanced gradients are not the root cause of the instability of training. We propose Admin to stabilize the early stage's training and unleash its full potential in the late stage.
arXiv Detail & Related papers (2020-04-17T13:59:07Z)
PowerNorm: Rethinking Batch Normalization in Transformers [96.14956636022957]
normalization method for neural network (NN) models used in Natural Language Processing (NLP) is layer normalization (LN) LN is preferred due to the empirical observation that a (naive/vanilla) use of BN leads to significant performance degradation for NLP tasks. We propose Power Normalization (PN), a novel normalization scheme that resolves this issue.
arXiv Detail & Related papers (2020-03-17T17:50:26Z)
On Layer Normalization in the Transformer Architecture [112.40350994368741]
We first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters. We show in experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines.
arXiv Detail & Related papers (2020-02-12T00:33:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.