Attention Sinks: A 'Catch, Tag, Release' Mechanism for Embeddings
- URL: http://arxiv.org/abs/2502.00919v2
- Date: Mon, 22 Sep 2025 16:16:25 GMT
- Title: Attention Sinks: A 'Catch, Tag, Release' Mechanism for Embeddings
- Authors: Stephen Zhang, Mustafa Khan, Vardan Papyan,
- Abstract summary: Large language models (LLMs) often concentrate their attention on a few specific tokens referred to as attention sinks.<n>Common examples include the first token, a prompt-independent sink, and punctuation tokens.<n>Despite their ubiquity, the function, semantic role, and origin of attention sinks remain poorly understood.
- Score: 16.950215926321558
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) often concentrate their attention on a few specific tokens referred to as attention sinks. Common examples include the first token, a prompt-independent sink, and punctuation tokens, which are prompt-dependent. While the tokens causing the sinks often lack direct semantic meaning, the presence of the sinks is critical for model performance, particularly under model compression and KV-caching. Despite their ubiquity, the function, semantic role, and origin of attention sinks -- especially those beyond the first token -- remain poorly understood. In this work, we conduct a comprehensive investigation demonstrating that attention sinks: catch a sequence of tokens, tag them using a common direction in embedding space, and release them back into the residual stream, where tokens are later retrieved based on the tags they have acquired. Probing experiments reveal these tags carry semantically meaningful information, such as the truth of a statement. These findings extend to reasoning models, where the mechanism spans more heads and explains greater variance in embeddings, or recent models with query-key normalization, where sinks remain just as prevalent. To encourage future theoretical analysis, we introduce a minimal problem which can be solved through the 'catch, tag, release' mechanism, and where it emerges through training.
Related papers
- The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks [32.60957674853853]
We study two recurring phenomena in Transformer language models.<n>Massive activations, in which a small number of tokens exhibit extreme outliers in a few channels, and attention sinks, in which certain tokens attract disproportionate attention mass regardless of semantic relevance.
arXiv Detail & Related papers (2026-03-05T18:59:04Z) - A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training [86.64715217940274]
Outliers function jointly with normalization.<n>Outliers serve more as rescale factors rather than contributors.<n>Outliers can be absorbed into learnable parameters or mitigated via explicit gated rescaling.
arXiv Detail & Related papers (2026-01-30T13:29:45Z) - Context Tokens are Anchors: Understanding the Repetition Curse in dMLLMs from an Information Flow Perspective [40.28551750991027]
cache mechanisms often introduce undesirable repetitive text generation.<n>We analyze repetition generation through the lens of information flow.<n>We present textbfCoTA, a plug-and-play method for mitigating repetition.
arXiv Detail & Related papers (2026-01-28T11:54:42Z) - Attention Needs to Focus: A Unified Perspective on Attention Allocation [37.34801068995858]
The Transformer architecture is a cornerstone of modern Large Language Models (LLMs)<n>Standard attention mechanism is plagued by well-documented issues: representational collapse and attention sink.<n>We present a unified perspective, arguing that both can be traced to a common root -- improper attention allocation.
arXiv Detail & Related papers (2026-01-01T08:39:15Z) - Causal Attention with Lookahead Keys [52.63961482746826]
In standard causal attention, each token's query, key, and value (QKV) are static and encode only preceding context.<n>We introduce CAuSal aTtention with Lookahead kEys (CASTLE), an attention mechanism that continually updates each token's keys as the context unfolds.
arXiv Detail & Related papers (2025-09-09T00:15:23Z) - Artifacts and Attention Sinks: Structured Approximations for Efficient Vision Transformers [8.486148475471271]
Vision transformers have emerged as a powerful tool across a wide range of applications, yet their inner workings remain only partially understood.<n>We examine the phenomenon of massive tokens - tokens with exceptionally high activation norms that act as attention sinks - and artifact tokens that emerge as a byproduct during inference.<n>We introduce Fast Nystr"om Attention (FNA), a training-free method that approximates self-attention in linear time and space.
arXiv Detail & Related papers (2025-07-21T19:29:03Z) - Think Clearly: Improving Reasoning via Redundant Token Pruning [57.01254508252785]
We show that deliberately removing redundancy in the reasoning process significantly improves performance.<n>We demonstrate that our method significantly improves overall accuracy across reasoning-intensive benchmarks without any training.
arXiv Detail & Related papers (2025-06-17T06:04:01Z) - Multipole Attention for Efficient Long Context Reasoning [64.94673641704289]
Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks.<n>LRMs need to generate long chain-of-thought reasoning in order to think before answering.<n>We introduce Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens.
arXiv Detail & Related papers (2025-06-16T03:00:40Z) - Vision Transformers Don't Need Trained Registers [17.412430704896455]
A sparse set of neurons is responsible for concentrating high-norm activations on outlier tokens.<n>We create a training-free approach to mitigate these artifacts.<n>Our results suggest that test-time registers effectively take on the role of register tokens at test-time.
arXiv Detail & Related papers (2025-06-09T17:59:57Z) - Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation [63.89280381800457]
We propose TokenBridge, which maintains the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens.
We introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism.
Our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction.
arXiv Detail & Related papers (2025-03-20T17:59:59Z) - Don't Take Things Out of Context: Attention Intervention for Enhancing Chain-of-Thought Reasoning in Large Language Models [32.71672086718058]
Few-shot Chain-of-Thought (CoT) significantly enhances the reasoning capabilities of large language models (LLMs)
We observe that isolated segments, words, or tokens within CoT demonstrations can unexpectedly disrupt the generation process of LLMs.
We propose a Few-shot Attention Intervention method (FAI) that dynamically analyzes the attention patterns of demonstrations to accurately identify these tokens.
arXiv Detail & Related papers (2025-03-14T07:46:33Z) - Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability [53.51560766150442]
Critical tokens are elements within reasoning trajectories that significantly influence incorrect outcomes.<n>We present a novel framework for identifying these tokens through rollout sampling.<n>We show that identifying and replacing critical tokens significantly improves model accuracy.
arXiv Detail & Related papers (2024-11-29T18:58:22Z) - Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study [38.492552119793]
We investigate an alternative attention mechanism based on the stick-breaking process in larger scale settings.<n>We study the implications of replacing the conventional softmax-based attention mechanism with stick-breaking attention.<n>When used as a drop-in replacement for current softmax+RoPE attention systems, we find that stick-breaking attention performs competitively with current methods.
arXiv Detail & Related papers (2024-10-23T15:51:13Z) - Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs [77.66717051042032]
Practitioners have consistently observed three puzzling phenomena in transformer-based large language models.
These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights.
We elucidate the mechanisms behind extreme-token phenomena.
arXiv Detail & Related papers (2024-10-17T17:54:06Z) - When Attention Sink Emerges in Language Models: An Empirical View [39.36282162213973]
Language Models (LMs) assign significant attention to the first token, even if it is not semantically important.<n>This phenomenon has been widely adopted in applications such as streaming/long context generation, KV cache optimization, inference acceleration, model quantization, and others.<n>We first demonstrate that attention sinks exist universally in LMs with various inputs, even in small models.
arXiv Detail & Related papers (2024-10-14T17:50:28Z) - A phase transition between positional and semantic learning in a solvable model of dot-product attention [30.96921029675713]
Morelinear model dot-product attention is studied as a non-dimensional self-attention layer with trainable and low-dimensional query and key data.
We show that either a positional attention mechanism (with tokens each other based on their respective positions) or a semantic attention mechanism (with tokens tied to each other based their meaning) or a transition from the former to the latter with increasing sample complexity.
arXiv Detail & Related papers (2024-02-06T11:13:54Z) - Self-attention Networks Localize When QK-eigenspectrum Concentrates [9.379890125442335]
Self-attention mechanism prevails in modern machine learning.
Two arguments have connected attention localization to the model performances.
We show that a small eigenspectrum variance leads attention to be localized.
arXiv Detail & Related papers (2024-02-03T09:35:53Z) - Improving Input-label Mapping with Demonstration Replay for In-context
Learning [67.57288926736923]
In-context learning (ICL) is an emerging capability of large autoregressive language models.
We propose a novel ICL method called Sliding Causal Attention (RdSca)
We show that our method significantly improves the input-label mapping in ICL demonstrations.
arXiv Detail & Related papers (2023-10-30T14:29:41Z) - Analyzing Chain-of-Thought Prompting in Large Language Models via
Gradient-based Feature Attributions [10.621564997491808]
Chain-of-thought (CoT) prompting has been shown to empirically improve the accuracy of large language models.
We investigate whether CoT prompting affects the relative importances they assign to particular input tokens.
Our results indicate that while CoT prompting does not increase the magnitude of saliency scores attributed to semantically relevant tokens in the prompt, it increases the robustness of saliency scores to question perturbations and variations in model output.
arXiv Detail & Related papers (2023-07-25T08:51:30Z) - ChiroDiff: Modelling chirographic data with Diffusion Models [132.5223191478268]
We introduce a powerful model-class namely "Denoising Diffusion Probabilistic Models" or DDPMs for chirographic data.
Our model named "ChiroDiff", being non-autoregressive, learns to capture holistic concepts and therefore remains resilient to higher temporal sampling rate.
arXiv Detail & Related papers (2023-04-07T15:17:48Z) - Toward Certified Robustness Against Real-World Distribution Shifts [65.66374339500025]
We train a generative model to learn perturbations from data and define specifications with respect to the output of the learned model.
A unique challenge arising from this setting is that existing verifiers cannot tightly approximate sigmoid activations.
We propose a general meta-algorithm for handling sigmoid activations which leverages classical notions of counter-example-guided abstraction refinement.
arXiv Detail & Related papers (2022-06-08T04:09:13Z) - Guiding Visual Question Answering with Attention Priors [76.21671164766073]
We propose to guide the attention mechanism using explicit linguistic-visual grounding.
This grounding is derived by connecting structured linguistic concepts in the query to their referents among the visual objects.
The resultant algorithm is capable of probing attention-based reasoning models, injecting relevant associative knowledge, and regulating the core reasoning process.
arXiv Detail & Related papers (2022-05-25T09:53:47Z) - Identifying and Mitigating Spurious Correlations for Improving
Robustness in NLP Models [19.21465581259624]
Many problems can be attributed to models exploiting spurious correlations, or shortcuts between the training data and the task labels.
In this paper, we aim to automatically identify such spurious correlations in NLP models at scale.
We show that our proposed method can effectively and efficiently identify a scalable set of "shortcuts", and mitigating these leads to more robust models in multiple applications.
arXiv Detail & Related papers (2021-10-14T21:40:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.