Self-attention Networks Localize When QK-eigenspectrum Concentrates
- URL: http://arxiv.org/abs/2402.02098v1
- Date: Sat, 3 Feb 2024 09:35:53 GMT
- Title: Self-attention Networks Localize When QK-eigenspectrum Concentrates
- Authors: Han Bao, Ryuichiro Hataya, Ryo Karakida
- Abstract summary: Self-attention mechanism prevails in modern machine learning.
Two arguments have connected attention localization to the model performances.
We show that a small eigenspectrum variance leads attention to be localized.
- Score: 9.379890125442335
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The self-attention mechanism prevails in modern machine learning. It has an
interesting functionality of adaptively selecting tokens from an input sequence
by modulating the degree of attention localization, which many researchers
speculate is the basis of the powerful model performance but complicates the
underlying mechanism of the learning dynamics. In recent years, mainly two
arguments have connected attention localization to the model performances. One
is the rank collapse, where the embedded tokens by a self-attention block
become very similar across different tokens, leading to a less expressive
network. The other is the entropy collapse, where the attention probability
approaches non-uniform and entails low entropy, making the learning dynamics
more likely to be trapped in plateaus. These two failure modes may apparently
contradict each other because the rank and entropy collapses are relevant to
uniform and non-uniform attention, respectively. To this end, we characterize
the notion of attention localization by the eigenspectrum of query-key
parameter matrices and reveal that a small eigenspectrum variance leads
attention to be localized. Interestingly, the small eigenspectrum variance
prevents both rank and entropy collapse, leading to better model expressivity
and trainability.
Related papers
- Attention Sinks and Outlier Features: A 'Catch, Tag, and Release' Mechanism for Embeddings [4.30907936718325]
Two prominent features of large language models (LLMs) is the presence of large-norm (outlier) features and the tendency for tokens to attend very strongly to a select few tokens.
We show that attention sinks utilize outlier features to: catch a sequence of tokens, tag the captured tokens by applying a common perturbation, and then release the tokens back into the residual stream.
arXiv Detail & Related papers (2025-02-02T21:15:07Z) - Attention Entropy is a Key Factor: An Analysis of Parallel Context Encoding with Full-attention-based Pre-trained Language Models [49.84163262868945]
Large language models have shown remarkable performance across a wide range of language tasks, owing to their exceptional capabilities in context modeling.
The most commonly used method of context modeling is full self-attention, as seen in standard decoder-only Transformers.
We propose parallel context encoding, which splits the context into sub-pieces and encodes them parallelly.
arXiv Detail & Related papers (2024-12-21T09:04:51Z) - Unveiling and Controlling Anomalous Attention Distribution in Transformers [8.456319173083315]
Waiver phenomenon allows elements to absorb excess attention without affecting their contribution to information.
In specific models, due to differences in positional encoding and attention patterns, we have found that the selection of waiver elements by the model can be categorized into two methods.
arXiv Detail & Related papers (2024-06-26T11:53:35Z) - On the Role of Attention Masks and LayerNorm in Transformers [55.81177251872377]
Self-attention is the key mechanism of transformers.
Recent studies have shown that pure self-attention suffers from an increasing degree of rank collapse.
arXiv Detail & Related papers (2024-05-29T05:41:28Z) - A phase transition between positional and semantic learning in a solvable model of dot-product attention [30.96921029675713]
Morelinear model dot-product attention is studied as a non-dimensional self-attention layer with trainable and low-dimensional query and key data.
We show that either a positional attention mechanism (with tokens each other based on their respective positions) or a semantic attention mechanism (with tokens tied to each other based their meaning) or a transition from the former to the latter with increasing sample complexity.
arXiv Detail & Related papers (2024-02-06T11:13:54Z) - An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models [64.87562101662952]
We show that input tokens are often exchangeable since they already include positional encodings.
We establish the existence of a sufficient and minimal representation of input tokens.
We prove that attention with the desired parameter infers the latent posterior up to an approximation error.
arXiv Detail & Related papers (2022-12-30T17:59:01Z) - Revisiting Attention Weights as Explanations from an Information
Theoretic Perspective [4.499369811647602]
We show that attention mechanisms have the potential to function as a shortcut to model explanations when they are carefully combined with other model elements.
Our findings indicate that attention mechanisms do have the potential to function as a shortcut to model explanations when they are carefully combined with other model elements.
arXiv Detail & Related papers (2022-10-31T12:53:20Z) - Guiding Visual Question Answering with Attention Priors [76.21671164766073]
We propose to guide the attention mechanism using explicit linguistic-visual grounding.
This grounding is derived by connecting structured linguistic concepts in the query to their referents among the visual objects.
The resultant algorithm is capable of probing attention-based reasoning models, injecting relevant associative knowledge, and regulating the core reasoning process.
arXiv Detail & Related papers (2022-05-25T09:53:47Z) - SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity.
Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism.
We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z) - Learning Hard Retrieval Decoder Attention for Transformers [69.40942736249397]
Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily.
We show that our hard retrieval attention mechanism is 1.43 times faster in decoding.
arXiv Detail & Related papers (2020-09-30T13:18:57Z) - Attention or memory? Neurointerpretable agents in space and time [0.0]
We design a model incorporating a self-attention mechanism that implements task-state representations in semantic feature-space.
To evaluate the agent's selective properties, we add a large volume of task-irrelevant features to observations.
In line with neuroscience predictions, self-attention leads to increased robustness to noise compared to benchmark models.
arXiv Detail & Related papers (2020-07-09T15:04:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.