Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Attention Layers
- URL: http://arxiv.org/abs/2410.07799v2
- Date: Mon, 03 Feb 2025 17:45:29 GMT
- Title: Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Attention Layers
- Authors: Alireza Naderi, Thiziri Nait Saada, Jared Tanner,
- Abstract summary: Alternatives to softmax-based attention are being due to its tendency to hinder effective information flow.
We conduct a rigorous analysis to uncover a spectral gap between the two largest singular gradients of the attention matrix.
We propose a novel simple practical solution to rank collapse in width by removing the outlier(s)
- Score: 3.686808512438363
- License:
- Abstract: Attention layers are the core component of transformers, the current state-of-the-art neural network architecture. Alternatives to softmax-based attention are being explored due to its tendency to hinder effective information flow. Even at initialisation, it remains poorly understood why the propagation of signals and gradients through these random networks can be pathological, resulting in issues known as (i) vanishing/exploding gradients and (ii) rank collapse $\textit{in depth}$, i.e. when all tokens converge to a single representation along layers. While rank collapse in depth naturally arises from repeated matrix multiplications$\unicode{x2013}$a common pattern across various architectures$\unicode{x2013}$we identify an additional and previously unknown challenge unique to softmax attention layers: (iii) rank collapse $\textit{in width}$, which occurs as the context length increases. Using Random Matrix Theory, we conduct a rigorous analysis that uncovers a spectral gap between the two largest singular values of the attention matrix as the cause of (iii), which in turn exacerbates (i) and (ii). Building on this insight, we propose a novel yet simple practical solution to mitigate rank collapse in width by removing the outlier eigenvalue(s). Our theoretical framework offers a fresh perspective on recent practical studies, such as (Ye et al., 2024; Ali et al., 2023), whose ad hoc solutions can now be interpreted as implicit efforts to address the spectral gap issue. This work provides valuable theoretical support for ongoing large-scale empirical research, bringing theory and practice one step closer in the understanding of transformers.
Related papers
- Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data.
We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z) - On the Role of Attention Masks and LayerNorm in Transformers [55.81177251872377]
Self-attention is the key mechanism of transformers.
Recent studies have shown that pure self-attention suffers from an increasing degree of rank collapse.
arXiv Detail & Related papers (2024-05-29T05:41:28Z) - A Unified Algebraic Perspective on Lipschitz Neural Networks [88.14073994459586]
This paper introduces a novel perspective unifying various types of 1-Lipschitz neural networks.
We show that many existing techniques can be derived and generalized via finding analytical solutions of a common semidefinite programming (SDP) condition.
Our approach, called SDP-based Lipschitz Layers (SLL), allows us to design non-trivial yet efficient generalization of convex potential layers.
arXiv Detail & Related papers (2023-03-06T14:31:09Z) - Improved Convergence Guarantees for Shallow Neural Networks [91.3755431537592]
We prove convergence of depth 2 neural networks, trained via gradient descent, to a global minimum.
Our model has the following features: regression with quadratic loss function, fully connected feedforward architecture, RelU activations, Gaussian data instances, adversarial labels.
They strongly suggest that, at least in our model, the convergence phenomenon extends well beyond the NTK regime''
arXiv Detail & Related papers (2022-12-05T14:47:52Z) - Scaling ResNets in the Large-depth Regime [11.374578778690623]
Deep ResNets are recognized for achieving state-of-the-art results in machine learning tasks.
Deep ResNets rely on a training procedure that needs to be carefully crafted to avoid vanishing or exploding gradients.
No consensus has been reached on how to mitigate this issue, although a widely discussed strategy consists in scaling the output of each layer by a factor $alpha_L$.
arXiv Detail & Related papers (2022-06-14T15:49:10Z) - Signal Propagation in Transformers: Theoretical Perspectives and the
Role of Rank Collapse [11.486545294602697]
We shed new light on the causes and effects of rank collapse in Transformers.
We show that rank collapse of the tokens' representations hinders training by causing the gradients of the queries and keys to vanish.
arXiv Detail & Related papers (2022-06-07T09:07:24Z) - Revisiting Over-smoothing in BERT from the Perspective of Graph [111.24636158179908]
Recently over-smoothing phenomenon of Transformer-based models is observed in both vision and language fields.
We find that layer normalization plays a key role in the over-smoothing issue of Transformer-based models.
We consider hierarchical fusion strategies, which combine the representations from different layers adaptively to make the output more diverse.
arXiv Detail & Related papers (2022-02-17T12:20:52Z) - Unified Field Theory for Deep and Recurrent Neural Networks [56.735884560668985]
We present a unified and systematic derivation of the mean-field theory for both recurrent and deep networks.
We find that convergence towards the mean-field theory is typically slower for recurrent networks than for deep networks.
Our method exposes that Gaussian processes are but the lowest order of a systematic expansion in $1/n$.
arXiv Detail & Related papers (2021-12-10T15:06:11Z) - A Geometric Analysis of Neural Collapse with Unconstrained Features [40.66585948844492]
We provide the first global optimization landscape analysis of $Neural;Collapse$.
This phenomenon arises in the last-layer classifiers and features of neural networks during the terminal phase of training.
arXiv Detail & Related papers (2021-05-06T00:00:50Z) - Batch Normalization Provably Avoids Rank Collapse for Randomly
Initialised Deep Networks [15.499554384036673]
Batch normalization is an effective strategy to avoid rank collapse for both linear and ReLU networks.
We derive a meaningful lower rank bound in deep linear networks.
Empirically, we also demonstrate that this rank robustness generalizes to ReLU nets.
arXiv Detail & Related papers (2020-03-03T17:21:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.