Attention is Not All You Need: Pure Attention Loses Rank Doubly
Exponentially with Depth
- URL: http://arxiv.org/abs/2103.03404v2
- Date: Tue, 1 Aug 2023 14:27:08 GMT
- Title: Attention is Not All You Need: Pure Attention Loses Rank Doubly
Exponentially with Depth
- Authors: Yihe Dong, Jean-Baptiste Cordonnier, Andreas Loukas
- Abstract summary: This work proposes a new way to understand self-attention networks.
We show that their output can be decomposed into a sum of smaller terms.
We prove that self-attention possesses a strong inductive bias towards "token"
- Score: 48.16156149749371
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention-based architectures have become ubiquitous in machine learning, yet
our understanding of the reasons for their effectiveness remains limited. This
work proposes a new way to understand self-attention networks: we show that
their output can be decomposed into a sum of smaller terms, each involving the
operation of a sequence of attention heads across layers. Using this
decomposition, we prove that self-attention possesses a strong inductive bias
towards "token uniformity". Specifically, without skip connections or
multi-layer perceptrons (MLPs), the output converges doubly exponentially to a
rank-1 matrix. On the other hand, skip connections and MLPs stop the output
from degeneration. Our experiments verify the identified convergence phenomena
on different variants of standard transformer architectures.
Related papers
- Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data.
We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z) - On the Benefits of Rank in Attention Layers [38.651863218241154]
We show that there are dramatic trade-offs between the rank and number of heads of the attention mechanism.
We present experiments with off-the-shelf transformers that validate our findings.
arXiv Detail & Related papers (2024-07-23T03:40:24Z) - FAST: Factorizable Attention for Speeding up Transformers [1.3637227185793512]
We present a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification.
Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
arXiv Detail & Related papers (2024-02-12T18:59:39Z) - Self-attention Networks Localize When QK-eigenspectrum Concentrates [9.379890125442335]
Self-attention mechanism prevails in modern machine learning.
Two arguments have connected attention localization to the model performances.
We show that a small eigenspectrum variance leads attention to be localized.
arXiv Detail & Related papers (2024-02-03T09:35:53Z) - Unraveling Attention via Convex Duality: Analysis and Interpretations of
Vision Transformers [52.468311268601056]
This paper analyzes attention through the lens of convex duality.
We derive equivalent finite-dimensional convex problems that are interpretable and solvable to global optimality.
We show how self-attention networks implicitly cluster the tokens, based on their latent similarity.
arXiv Detail & Related papers (2022-05-17T04:01:15Z) - Multiformer: A Head-Configurable Transformer-Based Model for Direct
Speech Translation [0.0]
Multiformer is a Transformer-based model which allows the use of different attention mechanisms on each head.
By doing this, the model is able to bias the self-attention towards the extraction of more diverse token interactions.
Results show that mixing attention patterns along the different heads and layers outperforms our baseline by up to 0.7 BLEU.
arXiv Detail & Related papers (2022-05-14T17:37:47Z) - Revisiting Over-smoothing in BERT from the Perspective of Graph [111.24636158179908]
Recently over-smoothing phenomenon of Transformer-based models is observed in both vision and language fields.
We find that layer normalization plays a key role in the over-smoothing issue of Transformer-based models.
We consider hierarchical fusion strategies, which combine the representations from different layers adaptively to make the output more diverse.
arXiv Detail & Related papers (2022-02-17T12:20:52Z) - Generic Attention-model Explainability for Interpreting Bi-Modal and
Encoder-Decoder Transformers [78.26411729589526]
We propose the first method to explain prediction by any Transformer-based architecture.
Our method is superior to all existing methods which are adapted from single modality explainability.
arXiv Detail & Related papers (2021-03-29T15:03:11Z) - Attention that does not Explain Away [54.42960937271612]
Models based on the Transformer architecture have achieved better accuracy than the ones based on competing architectures for a large set of tasks.
A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances.
We propose a doubly-normalized attention scheme that is simple to implement and provides theoretical guarantees for avoiding the "explaining away" effect.
arXiv Detail & Related papers (2020-09-29T21:05:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.