Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation
- URL: http://arxiv.org/abs/2505.24333v2
- Date: Fri, 26 Sep 2025 16:22:08 GMT
- Title: Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation
- Authors: Alessio Giorlandino, Sebastian Goldt,
- Abstract summary: Finding the right initialisation for neural networks is crucial to ensure smooth training and good performance.<n>In transformers, the wrong initialisation can lead to one of two failure modes of self-attention layers: rank collapse, where all tokens collapse into similar representations, and entropy collapse, where highly concentrated attention scores lead to instability.<n>Here, we provide an analytical theory of signal propagation through deep transformers with self-attention, layer normalisation, skip connections and gradients.
- Score: 8.973965016201822
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Finding the right initialisation for neural networks is crucial to ensure smooth training and good performance. In transformers, the wrong initialisation can lead to one of two failure modes of self-attention layers: rank collapse, where all tokens collapse into similar representations, and entropy collapse, where highly concentrated attention scores lead to training instability. While previous work has studied different scaling regimes for transformers, an asymptotically exact, down-to-the constant prescription for how to initialise transformers has so far been lacking. Here, we provide an analytical theory of signal propagation through deep transformers with self-attention, layer normalisation, skip connections and MLP. Our theory yields a simple algorithm to compute trainability diagrams that identify the correct choice of initialisation hyper-parameters for a given architecture. We overcome the key challenge, an exact treatment of the self-attention layer, by establishing a formal parallel with the Random Energy Model from statistical physics. We also analyse gradients in the backward path and determine the regime where gradients vanish at initialisation. We demonstrate the versatility of our framework through three case studies. Our theoretical framework gives a unified perspective on the two failure modes of self-attention and gives quantitative predictions on the scale of both weights and residual connections that guarantee smooth training.
Related papers
- Closed-Loop Transformers: Autoregressive Modeling as Iterative Latent Equilibrium [0.6820746164515952]
We introduce the closed-loop prediction principle, which requires that models iteratively refine latent representations until reaching a self-consistent equilibrium.<n>We instantiate this principle as Equilibrium Transformers, which augment standard transformer layers with an Equilibrium Refinement Module.<n>Preliminary experiments on the binary parity task demonstrate +3.28% average improvement on challenging sequences, with gains reaching +8.07% where standard transformers approach random performance.
arXiv Detail & Related papers (2025-11-26T20:02:59Z) - Initialization and training of matrix product state probabilistic models [10.391338066539237]
We investigate a common failure mode in training randomly matrix product states using gradient descent.<n>The trained MPS models do not accurately predict the strong interactions between boundary sites.<n>We propose two complementary strategies to overcome the training failure.
arXiv Detail & Related papers (2025-05-09T20:39:25Z) - How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias [48.9399496805422]
We focus on two representative tasks in the category of regular language recognition, known as even pairs' and parity check'<n>Our goal is to explore how a one-layer transformer, consisting of an attention layer followed by a linear layer, learns to solve these tasks.
arXiv Detail & Related papers (2025-05-02T00:07:35Z) - Training Dynamics of Transformers to Recognize Word Co-occurrence via Gradient Flow Analysis [97.54180451650122]
We study the dynamics of training a shallow transformer on a task of recognizing co-occurrence of two designated words.
We analyze the gradient flow dynamics of simultaneously training three attention matrices and a linear layer.
We prove a novel property of the gradient flow, termed textitautomatic balancing of gradients, which enables the loss values of different samples to decrease almost at the same rate and further facilitates the proof of near minimum training loss.
arXiv Detail & Related papers (2024-10-12T17:50:58Z) - Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape.
This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z) - Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Attention Layers [3.686808512438363]
Alternatives to softmax-based attention are being due to its tendency to hinder effective information flow.<n>We conduct a rigorous analysis to uncover a spectral gap between the two largest singular gradients of the attention matrix.<n>We propose a novel simple practical solution to rank collapse in width by removing the outlier(s)
arXiv Detail & Related papers (2024-10-10T10:34:18Z) - The Benefit of Being Bayesian in Online Conformal Prediction [7.713245413733777]
We study the online construction of confidence sets given a black-box machine learning model.<n>By converting the target confidence levels into quantile levels, the problem can be reduced to predicting the quantiles of a sequentially revealed data sequence.
arXiv Detail & Related papers (2024-10-03T15:04:47Z) - Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption.
We analyze how magnitude-based models affect generalization while improving adaption.
We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z) - Geometric Dynamics of Signal Propagation Predict Trainability of
Transformers [22.25628914395565]
We investigate forward signal propagation and gradient back propagation in deep, randomly transformers.
Our approach treats the evolution of $n tokens as they propagate through the transformer layers.
We show through experiments that, remarkably, the final test loss at the end of training is well predicted just by these two exponents.
arXiv Detail & Related papers (2024-03-05T01:30:34Z) - Grokking as the Transition from Lazy to Rich Training Dynamics [35.186196991224286]
grokking occurs when the train loss of a neural network decreases much earlier than its test loss.
Key determinants of grokking are the rate of feature learning and the alignment of the initial features with the target function.
arXiv Detail & Related papers (2023-10-09T19:33:21Z) - In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent.
For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z) - Outlier-robust neural network training: variation regularization meets trimmed loss to prevent functional breakdown [2.5628953713168685]
We tackle the challenge of outlier-robust predictive modeling using highly expressive neural networks.<n>Our approach integrates two key components: (1) a transformed trimmed loss (TTL), and (2) higher-order variation regularization (HOVR), which imposes smoothness constraints on the prediction function.
arXiv Detail & Related papers (2023-08-04T12:57:13Z) - Trained Transformers Learn Linear Models In-Context [39.56636898650966]
Attention-based neural networks as transformers have demonstrated a remarkable ability to exhibit inattention learning (ICL)
We show that when transformer training over random instances of linear regression problems, these models' predictions mimic nonlinear of ordinary squares.
arXiv Detail & Related papers (2023-06-16T15:50:03Z) - Machine learning in and out of equilibrium [58.88325379746631]
Our study uses a Fokker-Planck approach, adapted from statistical physics, to explore these parallels.
We focus in particular on the stationary state of the system in the long-time limit, which in conventional SGD is out of equilibrium.
We propose a new variation of Langevin dynamics (SGLD) that harnesses without replacement minibatching.
arXiv Detail & Related papers (2023-06-06T09:12:49Z) - Stabilizing Transformer Training by Preventing Attention Entropy
Collapse [56.45313891694746]
We investigate the training dynamics of Transformers by examining the evolution of the attention layers.
We show that $sigma$Reparam successfully prevents entropy collapse in the attention layers, promoting more stable training.
We conduct experiments with $sigma$Reparam on image classification, image self-supervised learning, machine translation, speech recognition, and language modeling tasks.
arXiv Detail & Related papers (2023-03-11T03:30:47Z) - Characterization of anomalous diffusion through convolutional
transformers [0.8984888893275713]
We propose a new transformer based neural network architecture for the characterization of anomalous diffusion.
Our new architecture, the Convolutional Transformer (ConvTransformer), uses a bi-layered convolutional neural network to extract features from our diffusive trajectories.
We show that the ConvTransformer is able to outperform the previous state of the art at determining the underlying diffusive regime in short trajectories.
arXiv Detail & Related papers (2022-10-10T18:53:13Z) - A heteroencoder architecture for prediction of failure locations in
porous metals using variational inference [1.2722697496405462]
We employ an encoder-decoder convolutional neural network to predict the failure locations of porous metal tension specimens.
The objective of predicting failure locations presents an extreme case of class imbalance since most of the material in the specimens do not fail.
We demonstrate that the resulting predicted variances are effective in ranking the locations that are most likely to fail in any given specimen.
arXiv Detail & Related papers (2022-01-31T20:26:53Z) - Distribution Mismatch Correction for Improved Robustness in Deep Neural
Networks [86.42889611784855]
normalization methods increase the vulnerability with respect to noise and input corruptions.
We propose an unsupervised non-parametric distribution correction method that adapts the activation distribution of each layer.
In our experiments, we empirically show that the proposed method effectively reduces the impact of intense image corruptions.
arXiv Detail & Related papers (2021-10-05T11:36:25Z) - Biologically Plausible Training Mechanisms for Self-Supervised Learning
in Deep Networks [14.685237010856953]
We develop biologically plausible training mechanisms for self-supervised learning (SSL) in deep networks.
We show that learning can be performed with one of two more plausible alternatives to backpagation.
arXiv Detail & Related papers (2021-09-30T12:56:57Z) - Understanding Self-supervised Learning with Dual Deep Networks [74.92916579635336]
We propose a novel framework to understand contrastive self-supervised learning (SSL) methods that employ dual pairs of deep ReLU networks.
We prove that in each SGD update of SimCLR with various loss functions, the weights at each layer are updated by a emphcovariance operator.
To further study what role the covariance operator plays and which features are learned in such a process, we model data generation and augmentation processes through a emphhierarchical latent tree model (HLTM)
arXiv Detail & Related papers (2020-10-01T17:51:49Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.