Self-attention Networks Localize When QK-eigenspectrum Concentrates
- URL: http://arxiv.org/abs/2402.02098v1
- Date: Sat, 3 Feb 2024 09:35:53 GMT
- Title: Self-attention Networks Localize When QK-eigenspectrum Concentrates
- Authors: Han Bao, Ryuichiro Hataya, Ryo Karakida
- Abstract summary: Self-attention mechanism prevails in modern machine learning.
Two arguments have connected attention localization to the model performances.
We show that a small eigenspectrum variance leads attention to be localized.
- Score: 9.379890125442335
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The self-attention mechanism prevails in modern machine learning. It has an
interesting functionality of adaptively selecting tokens from an input sequence
by modulating the degree of attention localization, which many researchers
speculate is the basis of the powerful model performance but complicates the
underlying mechanism of the learning dynamics. In recent years, mainly two
arguments have connected attention localization to the model performances. One
is the rank collapse, where the embedded tokens by a self-attention block
become very similar across different tokens, leading to a less expressive
network. The other is the entropy collapse, where the attention probability
approaches non-uniform and entails low entropy, making the learning dynamics
more likely to be trapped in plateaus. These two failure modes may apparently
contradict each other because the rank and entropy collapses are relevant to
uniform and non-uniform attention, respectively. To this end, we characterize
the notion of attention localization by the eigenspectrum of query-key
parameter matrices and reveal that a small eigenspectrum variance leads
attention to be localized. Interestingly, the small eigenspectrum variance
prevents both rank and entropy collapse, leading to better model expressivity
and trainability.
Related papers
- Unveiling and Controlling Anomalous Attention Distribution in Transformers [8.456319173083315]
Waiver phenomenon allows elements to absorb excess attention without affecting their contribution to information.
In specific models, due to differences in positional encoding and attention patterns, we have found that the selection of waiver elements by the model can be categorized into two methods.
arXiv Detail & Related papers (2024-06-26T11:53:35Z) - On the Role of Attention Masks and LayerNorm in Transformers [55.81177251872377]
Self-attention is the key mechanism of transformers.
Recent studies have shown that pure self-attention suffers from an increasing degree of rank collapse.
arXiv Detail & Related papers (2024-05-29T05:41:28Z) - A phase transition between positional and semantic learning in a solvable model of dot-product attention [30.96921029675713]
Morelinear model dot-product attention is studied as a non-dimensional self-attention layer with trainable and low-dimensional query and key data.
We show that either a positional attention mechanism (with tokens each other based on their respective positions) or a semantic attention mechanism (with tokens tied to each other based their meaning) or a transition from the former to the latter with increasing sample complexity.
arXiv Detail & Related papers (2024-02-06T11:13:54Z) - An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models [64.87562101662952]
We show that input tokens are often exchangeable since they already include positional encodings.
We establish the existence of a sufficient and minimal representation of input tokens.
We prove that attention with the desired parameter infers the latent posterior up to an approximation error.
arXiv Detail & Related papers (2022-12-30T17:59:01Z) - Revisiting Attention Weights as Explanations from an Information
Theoretic Perspective [4.499369811647602]
We show that attention mechanisms have the potential to function as a shortcut to model explanations when they are carefully combined with other model elements.
Our findings indicate that attention mechanisms do have the potential to function as a shortcut to model explanations when they are carefully combined with other model elements.
arXiv Detail & Related papers (2022-10-31T12:53:20Z) - Guiding Visual Question Answering with Attention Priors [76.21671164766073]
We propose to guide the attention mechanism using explicit linguistic-visual grounding.
This grounding is derived by connecting structured linguistic concepts in the query to their referents among the visual objects.
The resultant algorithm is capable of probing attention-based reasoning models, injecting relevant associative knowledge, and regulating the core reasoning process.
arXiv Detail & Related papers (2022-05-25T09:53:47Z) - SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity.
Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism.
We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z) - Bayesian Attention Modules [65.52970388117923]
We propose a scalable version of attention that is easy to implement and optimize.
Our experiments show the proposed method brings consistent improvements over the corresponding baselines.
arXiv Detail & Related papers (2020-10-20T20:30:55Z) - Learning Hard Retrieval Decoder Attention for Transformers [69.40942736249397]
Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily.
We show that our hard retrieval attention mechanism is 1.43 times faster in decoding.
arXiv Detail & Related papers (2020-09-30T13:18:57Z) - Attention or memory? Neurointerpretable agents in space and time [0.0]
We design a model incorporating a self-attention mechanism that implements task-state representations in semantic feature-space.
To evaluate the agent's selective properties, we add a large volume of task-irrelevant features to observations.
In line with neuroscience predictions, self-attention leads to increased robustness to noise compared to benchmark models.
arXiv Detail & Related papers (2020-07-09T15:04:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.