Related papers: Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

URL: http://arxiv.org/abs/2103.03404v2
Date: Tue, 1 Aug 2023 14:27:08 GMT
Title: Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth
Authors: Yihe Dong, Jean-Baptiste Cordonnier, Andreas Loukas
Abstract summary: This work proposes a new way to understand self-attention networks. We show that their output can be decomposed into a sum of smaller terms. We prove that self-attention possesses a strong inductive bias towards "token"
Score: 48.16156149749371
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Attention-based architectures have become ubiquitous in machine learning, yet our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their output can be decomposed into a sum of smaller terms, each involving the operation of a sequence of attention heads across layers. Using this decomposition, we prove that self-attention possesses a strong inductive bias towards "token uniformity". Specifically, without skip connections or multi-layer perceptrons (MLPs), the output converges doubly exponentially to a rank-1 matrix. On the other hand, skip connections and MLPs stop the output from degeneration. Our experiments verify the identified convergence phenomena on different variants of standard transformer architectures.

Related papers

Universal Approximation with Softmax Attention [10.857177487536656]
We prove that both (i) two-layer self-attention and (ii) one-layer self-attention are universal approximators for continuous sequence-to-sequence functions on compact domains.
arXiv Detail & Related papers (2025-04-22T14:51:33Z)
Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Attention Layers [3.686808512438363]
Alternatives to softmax-based attention are being due to its tendency to hinder effective information flow.<n>We conduct a rigorous analysis to uncover a spectral gap between the two largest singular gradients of the attention matrix.<n>We propose a novel simple practical solution to rank collapse in width by removing the outlier(s)
arXiv Detail & Related papers (2024-10-10T10:34:18Z)
Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z)
On the Benefits of Rank in Attention Layers [38.651863218241154]
We show that there are dramatic trade-offs between the rank and number of heads of the attention mechanism. We present experiments with off-the-shelf transformers that validate our findings.
arXiv Detail & Related papers (2024-07-23T03:40:24Z)
FAST: Factorizable Attention for Speeding up Transformers [1.3637227185793512]
We present a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification. Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
arXiv Detail & Related papers (2024-02-12T18:59:39Z)
Self-attention Networks Localize When QK-eigenspectrum Concentrates [9.379890125442335]
Self-attention mechanism prevails in modern machine learning. Two arguments have connected attention localization to the model performances. We show that a small eigenspectrum variance leads attention to be localized.
arXiv Detail & Related papers (2024-02-03T09:35:53Z)
Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers [52.468311268601056]
This paper analyzes attention through the lens of convex duality. We derive equivalent finite-dimensional convex problems that are interpretable and solvable to global optimality. We show how self-attention networks implicitly cluster the tokens, based on their latent similarity.
arXiv Detail & Related papers (2022-05-17T04:01:15Z)
Multiformer: A Head-Configurable Transformer-Based Model for Direct Speech Translation [0.0]
Multiformer is a Transformer-based model which allows the use of different attention mechanisms on each head. By doing this, the model is able to bias the self-attention towards the extraction of more diverse token interactions. Results show that mixing attention patterns along the different heads and layers outperforms our baseline by up to 0.7 BLEU.
arXiv Detail & Related papers (2022-05-14T17:37:47Z)
Revisiting Over-smoothing in BERT from the Perspective of Graph [111.24636158179908]
Recently over-smoothing phenomenon of Transformer-based models is observed in both vision and language fields. We find that layer normalization plays a key role in the over-smoothing issue of Transformer-based models. We consider hierarchical fusion strategies, which combine the representations from different layers adaptively to make the output more diverse.
arXiv Detail & Related papers (2022-02-17T12:20:52Z)
Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers [78.26411729589526]
We propose the first method to explain prediction by any Transformer-based architecture. Our method is superior to all existing methods which are adapted from single modality explainability.
arXiv Detail & Related papers (2021-03-29T15:03:11Z)
Attention that does not Explain Away [54.42960937271612]
Models based on the Transformer architecture have achieved better accuracy than the ones based on competing architectures for a large set of tasks. A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances. We propose a doubly-normalized attention scheme that is simple to implement and provides theoretical guarantees for avoiding the "explaining away" effect.
arXiv Detail & Related papers (2020-09-29T21:05:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.