Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Transformers
- URL: http://arxiv.org/abs/2410.07799v1
- Date: Thu, 10 Oct 2024 10:34:18 GMT
- Title: Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Transformers
- Authors: Alireza Naderi, Thiziri Nait Saada, Jared Tanner,
- Abstract summary: This paper examines signal propagation in textitattention-only transformers from a random matrix perspective.
We show that a textitspectral gap between the two largest singular values of the attention matrix causes rank collapse in width.
We propose a novel, yet simple, practical solution to resolve rank collapse in width by removing the spectral gap.
- Score: 3.686808512438363
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Attention layers are the core component of transformers, the current state-of-the-art neural network architecture. However, \softmaxx-based attention puts transformers' trainability at risk. Even \textit{at initialisation}, the propagation of signals and gradients through the random network can be pathological, resulting in known issues such as (i) vanishing/exploding gradients and (ii) \textit{rank collapse}, i.e. when all tokens converge to a single representation \textit{with depth}. This paper examines signal propagation in \textit{attention-only} transformers from a random matrix perspective, illuminating the origin of such issues, as well as unveiling a new phenomenon -- (iii) rank collapse \textit{in width}. Modelling \softmaxx-based attention at initialisation with Random Markov matrices, our theoretical analysis reveals that a \textit{spectral gap} between the two largest singular values of the attention matrix causes (iii), which, in turn, exacerbates (i) and (ii). Building on this insight, we propose a novel, yet simple, practical solution to resolve rank collapse in width by removing the spectral gap. Moreover, we validate our findings and discuss the training benefits of the proposed fix through experiments that also motivate a revision of some of the default parameter scaling. Our attention model accurately describes the standard key-query attention in a single-layer transformer, making this work a significant first step towards a better understanding of the initialisation dynamics in the multi-layer case.
Related papers
- Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data.
We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z) - Unveiling and Controlling Anomalous Attention Distribution in Transformers [8.456319173083315]
Waiver phenomenon allows elements to absorb excess attention without affecting their contribution to information.
In specific models, due to differences in positional encoding and attention patterns, we have found that the selection of waiver elements by the model can be categorized into two methods.
arXiv Detail & Related papers (2024-06-26T11:53:35Z) - Local to Global: Learning Dynamics and Effect of Initialization for Transformers [20.02103237675619]
We focus on first-order Markov chains and single-layer transformers.
We prove that transformer parameters trained on next-token prediction loss can either converge to global or local minima.
arXiv Detail & Related papers (2024-06-05T08:57:41Z) - On the Role of Attention Masks and LayerNorm in Transformers [55.81177251872377]
Self-attention is the key mechanism of transformers.
Recent studies have shown that pure self-attention suffers from an increasing degree of rank collapse.
arXiv Detail & Related papers (2024-05-29T05:41:28Z) - Texture, Shape and Order Matter: A New Transformer Design for Sequential DeepFake Detection [57.100891917805086]
Sequential DeepFake detection is an emerging task that predicts the manipulation sequence in order.
This paper describes a new Transformer design, called TSOM, by exploring three perspectives: Texture, Shape, and Order of Manipulations.
arXiv Detail & Related papers (2024-04-22T04:47:52Z) - On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting.
Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z) - A Unified Algebraic Perspective on Lipschitz Neural Networks [88.14073994459586]
This paper introduces a novel perspective unifying various types of 1-Lipschitz neural networks.
We show that many existing techniques can be derived and generalized via finding analytical solutions of a common semidefinite programming (SDP) condition.
Our approach, called SDP-based Lipschitz Layers (SLL), allows us to design non-trivial yet efficient generalization of convex potential layers.
arXiv Detail & Related papers (2023-03-06T14:31:09Z) - Improved Convergence Guarantees for Shallow Neural Networks [91.3755431537592]
We prove convergence of depth 2 neural networks, trained via gradient descent, to a global minimum.
Our model has the following features: regression with quadratic loss function, fully connected feedforward architecture, RelU activations, Gaussian data instances, adversarial labels.
They strongly suggest that, at least in our model, the convergence phenomenon extends well beyond the NTK regime''
arXiv Detail & Related papers (2022-12-05T14:47:52Z) - Scaling ResNets in the Large-depth Regime [11.374578778690623]
Deep ResNets are recognized for achieving state-of-the-art results in machine learning tasks.
Deep ResNets rely on a training procedure that needs to be carefully crafted to avoid vanishing or exploding gradients.
No consensus has been reached on how to mitigate this issue, although a widely discussed strategy consists in scaling the output of each layer by a factor $alpha_L$.
arXiv Detail & Related papers (2022-06-14T15:49:10Z) - Signal Propagation in Transformers: Theoretical Perspectives and the
Role of Rank Collapse [11.486545294602697]
We shed new light on the causes and effects of rank collapse in Transformers.
We show that rank collapse of the tokens' representations hinders training by causing the gradients of the queries and keys to vanish.
arXiv Detail & Related papers (2022-06-07T09:07:24Z) - Unraveling Attention via Convex Duality: Analysis and Interpretations of
Vision Transformers [52.468311268601056]
This paper analyzes attention through the lens of convex duality.
We derive equivalent finite-dimensional convex problems that are interpretable and solvable to global optimality.
We show how self-attention networks implicitly cluster the tokens, based on their latent similarity.
arXiv Detail & Related papers (2022-05-17T04:01:15Z) - Revisiting Over-smoothing in BERT from the Perspective of Graph [111.24636158179908]
Recently over-smoothing phenomenon of Transformer-based models is observed in both vision and language fields.
We find that layer normalization plays a key role in the over-smoothing issue of Transformer-based models.
We consider hierarchical fusion strategies, which combine the representations from different layers adaptively to make the output more diverse.
arXiv Detail & Related papers (2022-02-17T12:20:52Z) - Unified Field Theory for Deep and Recurrent Neural Networks [56.735884560668985]
We present a unified and systematic derivation of the mean-field theory for both recurrent and deep networks.
We find that convergence towards the mean-field theory is typically slower for recurrent networks than for deep networks.
Our method exposes that Gaussian processes are but the lowest order of a systematic expansion in $1/n$.
arXiv Detail & Related papers (2021-12-10T15:06:11Z) - A Fourier-based Framework for Domain Generalization [82.54650565298418]
Domain generalization aims at tackling this problem by learning transferable knowledge from multiple source domains in order to generalize to unseen target domains.
This paper introduces a novel Fourier-based perspective for domain generalization.
Experiments on three benchmarks have demonstrated that the proposed method is able to achieve state-of-the-arts performance for domain generalization.
arXiv Detail & Related papers (2021-05-24T06:50:30Z) - A Geometric Analysis of Neural Collapse with Unconstrained Features [40.66585948844492]
We provide the first global optimization landscape analysis of $Neural;Collapse$.
This phenomenon arises in the last-layer classifiers and features of neural networks during the terminal phase of training.
arXiv Detail & Related papers (2021-05-06T00:00:50Z) - Masked Language Modeling for Proteins via Linearly Scalable Long-Context
Transformers [42.93754828584075]
We present a new Transformer architecture, Performer, based on Fast Attention Via Orthogonal Random features (FAVOR)
Our mechanism scales linearly rather than quadratically in the number of tokens in the sequence, is characterized by sub-quadratic space complexity and does not incorporate any sparsity pattern priors.
It provides strong theoretical guarantees: unbiased estimation of the attention matrix and uniform convergence.
arXiv Detail & Related papers (2020-06-05T17:09:16Z) - Batch Normalization Provably Avoids Rank Collapse for Randomly
Initialised Deep Networks [15.499554384036673]
Batch normalization is an effective strategy to avoid rank collapse for both linear and ReLU networks.
We derive a meaningful lower rank bound in deep linear networks.
Empirically, we also demonstrate that this rank robustness generalizes to ReLU nets.
arXiv Detail & Related papers (2020-03-03T17:21:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.