Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation
- URL: http://arxiv.org/abs/2505.24333v1
- Date: Fri, 30 May 2025 08:18:23 GMT
- Title: Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation
- Authors: Alessio Giorlandino, Sebastian Goldt,
- Abstract summary: Finding the right initialisation for neural networks is crucial to ensure smooth training and good performance.<n>In transformers, the wrong initialisation can lead to one of two failure modes of self-attention layers: rank collapse and entropy collapse.<n>Here, we provide an analytical theory of signal propagation through vanilla transformer blocks with self-attention layers.
- Score: 7.2136602534376015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Finding the right initialisation for neural networks is crucial to ensure smooth training and good performance. In transformers, the wrong initialisation can lead to one of two failure modes of self-attention layers: rank collapse, where all tokens collapse into similar representations, and entropy collapse, where highly concentrated attention scores lead to training instability. While the right initialisation has been extensively studied in feed-forward networks, an exact description of signal propagation through a full transformer block has so far been lacking. Here, we provide an analytical theory of signal propagation through vanilla transformer blocks with self-attention layers, layer normalisation, skip connections and ReLU MLP. To treat the self-attention layer, we draw on a formal parallel with the Random Energy Model from statistical physics. We identify and characterise two regimes governed by the variance of the query and key initialisations: a low-variance regime, where we recover the known rank collapse behaviour; and a previously unexplored high-variance regime, where signal is preserved but \textit{entropy collapse} occurs. In the low-variance regime, we calculate the critical strength for the residual connection to ensure signal propagation. Our theory yields trainability diagrams that identify the correct choice of initialisation hyper-parameters for a given architecture. Experiments with BERT-style models trained on TinyStories validate our predictions. Our theoretical framework gives a unified perspective on the two failure modes of self-attention and gives quantitative predictions on the scale of both weights and residual connections that guarantees smooth training.
Related papers
- Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Attention Layers [3.686808512438363]
Alternatives to softmax-based attention are being due to its tendency to hinder effective information flow.<n>We conduct a rigorous analysis to uncover a spectral gap between the two largest singular gradients of the attention matrix.<n>We propose a novel simple practical solution to rank collapse in width by removing the outlier(s)
arXiv Detail & Related papers (2024-10-10T10:34:18Z) - The Benefit of Being Bayesian in Online Conformal Prediction [7.713245413733777]
We study the online construction of confidence sets given a black-box machine learning model.<n>By converting the target confidence levels into quantile levels, the problem can be reduced to predicting the quantiles of a sequentially revealed data sequence.
arXiv Detail & Related papers (2024-10-03T15:04:47Z) - Grokking as the Transition from Lazy to Rich Training Dynamics [35.186196991224286]
grokking occurs when the train loss of a neural network decreases much earlier than its test loss.
Key determinants of grokking are the rate of feature learning and the alignment of the initial features with the target function.
arXiv Detail & Related papers (2023-10-09T19:33:21Z) - In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent.
For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z) - Outlier-robust neural network training: variation regularization meets trimmed loss to prevent functional breakdown [2.5628953713168685]
We tackle the challenge of outlier-robust predictive modeling using highly expressive neural networks.<n>Our approach integrates two key components: (1) a transformed trimmed loss (TTL), and (2) higher-order variation regularization (HOVR), which imposes smoothness constraints on the prediction function.
arXiv Detail & Related papers (2023-08-04T12:57:13Z) - Trained Transformers Learn Linear Models In-Context [39.56636898650966]
Attention-based neural networks as transformers have demonstrated a remarkable ability to exhibit inattention learning (ICL)
We show that when transformer training over random instances of linear regression problems, these models' predictions mimic nonlinear of ordinary squares.
arXiv Detail & Related papers (2023-06-16T15:50:03Z) - Machine learning in and out of equilibrium [58.88325379746631]
Our study uses a Fokker-Planck approach, adapted from statistical physics, to explore these parallels.
We focus in particular on the stationary state of the system in the long-time limit, which in conventional SGD is out of equilibrium.
We propose a new variation of Langevin dynamics (SGLD) that harnesses without replacement minibatching.
arXiv Detail & Related papers (2023-06-06T09:12:49Z) - Stabilizing Transformer Training by Preventing Attention Entropy
Collapse [56.45313891694746]
We investigate the training dynamics of Transformers by examining the evolution of the attention layers.
We show that $sigma$Reparam successfully prevents entropy collapse in the attention layers, promoting more stable training.
We conduct experiments with $sigma$Reparam on image classification, image self-supervised learning, machine translation, speech recognition, and language modeling tasks.
arXiv Detail & Related papers (2023-03-11T03:30:47Z) - Characterization of anomalous diffusion through convolutional
transformers [0.8984888893275713]
We propose a new transformer based neural network architecture for the characterization of anomalous diffusion.
Our new architecture, the Convolutional Transformer (ConvTransformer), uses a bi-layered convolutional neural network to extract features from our diffusive trajectories.
We show that the ConvTransformer is able to outperform the previous state of the art at determining the underlying diffusive regime in short trajectories.
arXiv Detail & Related papers (2022-10-10T18:53:13Z) - A heteroencoder architecture for prediction of failure locations in
porous metals using variational inference [1.2722697496405462]
We employ an encoder-decoder convolutional neural network to predict the failure locations of porous metal tension specimens.
The objective of predicting failure locations presents an extreme case of class imbalance since most of the material in the specimens do not fail.
We demonstrate that the resulting predicted variances are effective in ranking the locations that are most likely to fail in any given specimen.
arXiv Detail & Related papers (2022-01-31T20:26:53Z) - Distribution Mismatch Correction for Improved Robustness in Deep Neural
Networks [86.42889611784855]
normalization methods increase the vulnerability with respect to noise and input corruptions.
We propose an unsupervised non-parametric distribution correction method that adapts the activation distribution of each layer.
In our experiments, we empirically show that the proposed method effectively reduces the impact of intense image corruptions.
arXiv Detail & Related papers (2021-10-05T11:36:25Z) - Biologically Plausible Training Mechanisms for Self-Supervised Learning
in Deep Networks [14.685237010856953]
We develop biologically plausible training mechanisms for self-supervised learning (SSL) in deep networks.
We show that learning can be performed with one of two more plausible alternatives to backpagation.
arXiv Detail & Related papers (2021-09-30T12:56:57Z) - Understanding Self-supervised Learning with Dual Deep Networks [74.92916579635336]
We propose a novel framework to understand contrastive self-supervised learning (SSL) methods that employ dual pairs of deep ReLU networks.
We prove that in each SGD update of SimCLR with various loss functions, the weights at each layer are updated by a emphcovariance operator.
To further study what role the covariance operator plays and which features are learned in such a process, we model data generation and augmentation processes through a emphhierarchical latent tree model (HLTM)
arXiv Detail & Related papers (2020-10-01T17:51:49Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.