Attention is Not All You Need: Pure Attention Loses Rank Doubly
Exponentially with Depth
- URL: http://arxiv.org/abs/2103.03404v2
- Date: Tue, 1 Aug 2023 14:27:08 GMT
- Title: Attention is Not All You Need: Pure Attention Loses Rank Doubly
Exponentially with Depth
- Authors: Yihe Dong, Jean-Baptiste Cordonnier, Andreas Loukas
- Abstract summary: This work proposes a new way to understand self-attention networks.
We show that their output can be decomposed into a sum of smaller terms.
We prove that self-attention possesses a strong inductive bias towards "token"
- Score: 48.16156149749371
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention-based architectures have become ubiquitous in machine learning, yet
our understanding of the reasons for their effectiveness remains limited. This
work proposes a new way to understand self-attention networks: we show that
their output can be decomposed into a sum of smaller terms, each involving the
operation of a sequence of attention heads across layers. Using this
decomposition, we prove that self-attention possesses a strong inductive bias
towards "token uniformity". Specifically, without skip connections or
multi-layer perceptrons (MLPs), the output converges doubly exponentially to a
rank-1 matrix. On the other hand, skip connections and MLPs stop the output
from degeneration. Our experiments verify the identified convergence phenomena
on different variants of standard transformer architectures.
Related papers
- The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks [32.60957674853853]
We study two recurring phenomena in Transformer language models.<n>Massive activations, in which a small number of tokens exhibit extreme outliers in a few channels, and attention sinks, in which certain tokens attract disproportionate attention mass regardless of semantic relevance.
arXiv Detail & Related papers (2026-03-05T18:59:04Z) - A Provable Expressiveness Hierarchy in Hybrid Linear-Full Attention [13.144793724034761]
Transformers serve as the foundation of most modern large language models.<n>A fundamental gap remains: their expressive power relative to full attention lacks a rigorous theoretical characterization.<n>Our work provides the first provable separation between hybrid attention and standard full attention, offering a theoretical perspective for understanding the fundamental capabilities and limitations of different attention mechanisms.
arXiv Detail & Related papers (2026-02-02T07:47:21Z) - A Statistical Theory of Gated Attention through the Lens of Hierarchical Mixture of Experts [80.98474052840929]
Gated attention has been empirically demonstrated to increase the expressiveness of low-rank mapping in standard attention.<n>We show that each entry in a gated attention matrix or a multi-head self-attention matrix can be written as a hierarchical mixture of experts.
arXiv Detail & Related papers (2026-02-01T22:22:13Z) - TensorLens: End-to-End Transformer Analysis via High-Order Attention Tensors [53.891337639229285]
We introduce attentionLens, a novel formulation that captures the entire transformer as a single, input-dependent linear operator expressed through a high-order attention-interaction connection.<n>Our experiments demonstrate that the attention tensor can serve as a powerful foundation for developing tools aimed at interpretability and model understanding.
arXiv Detail & Related papers (2026-01-25T19:21:25Z) - Universal Approximation with Softmax Attention [10.857177487536656]
We prove that both (i) two-layer self-attention and (ii) one-layer self-attention are universal approximators for continuous sequence-to-sequence functions on compact domains.
arXiv Detail & Related papers (2025-04-22T14:51:33Z) - Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Attention Layers [3.686808512438363]
Alternatives to softmax-based attention are being due to its tendency to hinder effective information flow.<n>We conduct a rigorous analysis to uncover a spectral gap between the two largest singular gradients of the attention matrix.<n>We propose a novel simple practical solution to rank collapse in width by removing the outlier(s)
arXiv Detail & Related papers (2024-10-10T10:34:18Z) - Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data.
We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z) - On the Benefits of Rank in Attention Layers [38.651863218241154]
We show that there are dramatic trade-offs between the rank and number of heads of the attention mechanism.
We present experiments with off-the-shelf transformers that validate our findings.
arXiv Detail & Related papers (2024-07-23T03:40:24Z) - FAST: Factorizable Attention for Speeding up Transformers [1.3637227185793512]
We present a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification.
Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
arXiv Detail & Related papers (2024-02-12T18:59:39Z) - Self-attention Networks Localize When QK-eigenspectrum Concentrates [9.379890125442335]
Self-attention mechanism prevails in modern machine learning.
Two arguments have connected attention localization to the model performances.
We show that a small eigenspectrum variance leads attention to be localized.
arXiv Detail & Related papers (2024-02-03T09:35:53Z) - Unraveling Attention via Convex Duality: Analysis and Interpretations of
Vision Transformers [52.468311268601056]
This paper analyzes attention through the lens of convex duality.
We derive equivalent finite-dimensional convex problems that are interpretable and solvable to global optimality.
We show how self-attention networks implicitly cluster the tokens, based on their latent similarity.
arXiv Detail & Related papers (2022-05-17T04:01:15Z) - Multiformer: A Head-Configurable Transformer-Based Model for Direct
Speech Translation [0.0]
Multiformer is a Transformer-based model which allows the use of different attention mechanisms on each head.
By doing this, the model is able to bias the self-attention towards the extraction of more diverse token interactions.
Results show that mixing attention patterns along the different heads and layers outperforms our baseline by up to 0.7 BLEU.
arXiv Detail & Related papers (2022-05-14T17:37:47Z) - Revisiting Over-smoothing in BERT from the Perspective of Graph [111.24636158179908]
Recently over-smoothing phenomenon of Transformer-based models is observed in both vision and language fields.
We find that layer normalization plays a key role in the over-smoothing issue of Transformer-based models.
We consider hierarchical fusion strategies, which combine the representations from different layers adaptively to make the output more diverse.
arXiv Detail & Related papers (2022-02-17T12:20:52Z) - Generic Attention-model Explainability for Interpreting Bi-Modal and
Encoder-Decoder Transformers [78.26411729589526]
We propose the first method to explain prediction by any Transformer-based architecture.
Our method is superior to all existing methods which are adapted from single modality explainability.
arXiv Detail & Related papers (2021-03-29T15:03:11Z) - Attention that does not Explain Away [54.42960937271612]
Models based on the Transformer architecture have achieved better accuracy than the ones based on competing architectures for a large set of tasks.
A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances.
We propose a doubly-normalized attention scheme that is simple to implement and provides theoretical guarantees for avoiding the "explaining away" effect.
arXiv Detail & Related papers (2020-09-29T21:05:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.