Signal Propagation in Transformers: Theoretical Perspectives and the
Role of Rank Collapse
- URL: http://arxiv.org/abs/2206.03126v1
- Date: Tue, 7 Jun 2022 09:07:24 GMT
- Title: Signal Propagation in Transformers: Theoretical Perspectives and the
Role of Rank Collapse
- Authors: Lorenzo Noci, Sotiris Anagnostidis, Luca Biggio, Antonio Orvieto,
Sidak Pal Singh, Aurelien Lucchi
- Abstract summary: We shed new light on the causes and effects of rank collapse in Transformers.
We show that rank collapse of the tokens' representations hinders training by causing the gradients of the queries and keys to vanish.
- Score: 11.486545294602697
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers have achieved remarkable success in several domains, ranging
from natural language processing to computer vision. Nevertheless, it has been
recently shown that stacking self-attention layers - the distinctive
architectural component of Transformers - can result in rank collapse of the
tokens' representations at initialization. The question of if and how rank
collapse affects training is still largely unanswered, and its investigation is
necessary for a more comprehensive understanding of this architecture. In this
work, we shed new light on the causes and the effects of this phenomenon.
First, we show that rank collapse of the tokens' representations hinders
training by causing the gradients of the queries and keys to vanish at
initialization. Furthermore, we provide a thorough description of the origin of
rank collapse and discuss how to prevent it via an appropriate depth-dependent
scaling of the residual branches. Finally, our analysis unveils that specific
architectural hyperparameters affect the gradients of queries and values
differently, leading to disproportionate gradient norms. This suggests an
explanation for the widespread use of adaptive methods for Transformers'
optimization.
Related papers
- Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape.
This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z) - Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Transformers [3.686808512438363]
This paper examines signal propagation in textitattention-only transformers from a random matrix perspective.
We show that a textitspectral gap between the two largest singular values of the attention matrix causes rank collapse in width.
We propose a novel, yet simple, practical solution to resolve rank collapse in width by removing the spectral gap.
arXiv Detail & Related papers (2024-10-10T10:34:18Z) - How Transformers Utilize Multi-Head Attention in In-Context Learning? A Case Study on Sparse Linear Regression [19.64743851296488]
In this study, we consider a sparse linear regression problem and investigate how a trained multi-head transformer performs in-context learning.
We experimentally discover that the utilization of multi-heads exhibits different patterns across layers.
We demonstrate that such a preprocess-then-optimize algorithm can significantly outperform naive gradient descent and ridge regression algorithms.
arXiv Detail & Related papers (2024-08-08T15:33:02Z) - How Transformers Learn Causal Structure with Gradient Descent [44.31729147722701]
Self-attention allows transformers to encode causal structure.
We introduce an in-context learning task that requires learning latent causal structure.
We show that transformers trained on our in-context learning task are able to recover a wide variety of causal structures.
arXiv Detail & Related papers (2024-02-22T17:47:03Z) - Centered Self-Attention Layers [89.21791761168032]
The self-attention mechanism in transformers and the message-passing mechanism in graph neural networks are repeatedly applied.
We show that this application inevitably leads to oversmoothing, i.e., to similar representations at the deeper layers.
We present a correction term to the aggregating operator of these mechanisms.
arXiv Detail & Related papers (2023-06-02T15:19:08Z) - Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations.
We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z) - Unveiling Transformers with LEGO: a synthetic reasoning task [23.535488809197787]
We study how the transformer architecture learns to follow a chain of reasoning.
In some data regime the trained transformer finds "shortcut" solutions to follow the chain of reasoning.
We find that one can prevent such shortcut with appropriate architecture modification or careful data preparation.
arXiv Detail & Related papers (2022-06-09T06:30:17Z) - Revisiting Over-smoothing in BERT from the Perspective of Graph [111.24636158179908]
Recently over-smoothing phenomenon of Transformer-based models is observed in both vision and language fields.
We find that layer normalization plays a key role in the over-smoothing issue of Transformer-based models.
We consider hierarchical fusion strategies, which combine the representations from different layers adaptively to make the output more diverse.
arXiv Detail & Related papers (2022-02-17T12:20:52Z) - XAI for Transformers: Better Explanations through Conservative
Propagation [60.67748036747221]
We show that the gradient in a Transformer reflects the function only locally, and thus fails to reliably identify the contribution of input features to the prediction.
Our proposal can be seen as a proper extension of the well-established LRP method to Transformers.
arXiv Detail & Related papers (2022-02-15T10:47:11Z) - Extreme Memorization via Scale of Initialization [72.78162454173803]
We construct an experimental setup in which changing the scale of initialization strongly impacts the implicit regularization induced by SGD.
We find that the extent and manner in which generalization ability is affected depends on the activation and loss function used.
In the case of the homogeneous ReLU activation, we show that this behavior can be attributed to the loss function.
arXiv Detail & Related papers (2020-08-31T04:53:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.