Stability of Transformers under Layer Normalization
- URL: http://arxiv.org/abs/2510.09904v1
- Date: Fri, 10 Oct 2025 22:27:20 GMT
- Title: Stability of Transformers under Layer Normalization
- Authors: Kelvin Kan, Xingjian Li, Benjamin J. Zhang, Tuhin Sahai, Stanley Osher, Krishna Kumar, Markos A. Katsoulakis,
- Abstract summary: We study the stability of deep Transformers under different layer normalization placements.<n>We derive explicit bounds on the growth of hidden states in trained Transformers.<n>Our framework provides a principled way to sanity-check the stability of Transformers under new architectural modifications.
- Score: 7.235320241343618
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite their widespread use, training deep Transformers can be unstable. Layer normalization, a standard component, improves training stability, but its placement has often been ad-hoc. In this paper, we conduct a principled study on the forward (hidden states) and backward (gradient) stability of Transformers under different layer normalization placements. Our theory provides key insights into the training dynamics: whether training drives Transformers toward regular solutions or pathological behaviors. For forward stability, we derive explicit bounds on the growth of hidden states in trained Transformers. For backward stability, we analyze how layer normalization affects the backpropagation of gradients, thereby explaining the training dynamics of each layer normalization placement. Our analysis also guides the scaling of residual steps in Transformer blocks, where appropriate choices can further improve stability and performance. Our numerical results corroborate our theoretical findings. Beyond these results, our framework provides a principled way to sanity-check the stability of Transformers under new architectural modifications, offering guidance for future designs.
Related papers
- Vision Transformer Finetuning Benefits from Non-Smooth Components [13.900418575589134]
We analyze the ability of vision transformer components to adapt their outputs to changes in inputs, or, in other words, their plasticity.<n>A key takeaway for practitioners is that the high plasticity of the attention modules and feedforward layers consistently leads to better finetuning performance.
arXiv Detail & Related papers (2026-02-06T17:12:22Z) - A Constrained Optimization Perspective of Unrolled Transformers [77.12297732942095]
We introduce a constrained optimization framework for training transformers that behave like optimization descent algorithms.<n>We observe constrained transformers achieve stronger to perturbations robustness and maintain higher out-of-distribution generalization.
arXiv Detail & Related papers (2026-01-24T02:12:39Z) - OT-Transformer: A Continuous-time Transformer Architecture with Optimal Transport Regularization [1.7180235064112577]
We consider a dynamical system whose governing equation is parametrized by transformer blocks.<n>We leverage optimal transport theory to regularize the training problem, which enhances stability in training and improves generalization of the resulting model.
arXiv Detail & Related papers (2025-01-30T22:52:40Z) - Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization [88.5582111768376]
We study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model.
Our results establish a sharp condition that can distinguish between the small test error phase and the large test error regime, based on the signal-to-noise ratio in the data model.
arXiv Detail & Related papers (2024-09-28T13:24:11Z) - BranchNorm: Robustly Scaling Extremely Deep Transformers [55.92852268168816]
BranchNorm dynamically rescales the non-residual branch of Transformer in accordance with the training period.
Experiment results on multiple translation tasks demonstrate that BranchNorm achieves a better trade-off between training stability and converge performance.
arXiv Detail & Related papers (2023-05-04T12:46:12Z) - Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations.
We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z) - Understanding the Difficulty of Training Transformers [120.99980924577787]
We show that unbalanced gradients are not the root cause of the instability of training.
We propose Admin to stabilize the early stage's training and unleash its full potential in the late stage.
arXiv Detail & Related papers (2020-04-17T13:59:07Z) - On Layer Normalization in the Transformer Architecture [112.40350994368741]
We first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters.
We show in experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines.
arXiv Detail & Related papers (2020-02-12T00:33:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.