Related papers: The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

URL: http://arxiv.org/abs/2603.05498v1
Date: Thu, 05 Mar 2026 18:59:04 GMT
Title: The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks
Authors: Shangwen Sun, Alfredo Canziani, Yann LeCun, Jiachen Zhu,
Abstract summary: We study two recurring phenomena in Transformer language models.<n>Massive activations, in which a small number of tokens exhibit extreme outliers in a few channels, and attention sinks, in which certain tokens attract disproportionate attention mass regardless of semantic relevance.
Score: 32.60957674853853
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: We study two recurring phenomena in Transformer language models: massive activations, in which a small number of tokens exhibit extreme outliers in a few channels, and attention sinks, in which certain tokens attract disproportionate attention mass regardless of semantic relevance. Prior work observes that these phenomena frequently co-occur and often involve the same tokens, but their functional roles and causal relationship remain unclear. Through systematic experiments, we show that the co-occurrence is largely an architectural artifact of modern Transformer design, and that the two phenomena serve related but distinct functions. Massive activations operate globally: they induce near-constant hidden representations that persist across layers, effectively functioning as implicit parameters of the model. Attention sinks operate locally: they modulate attention outputs across heads and bias individual heads toward short-range dependencies. We identify the pre-norm configuration as the key choice that enables the co-occurrence, and show that ablating it causes the two phenomena to decouple.

Related papers

Krause Synchronization Transformers [63.8469912831803]
Self-attention in Transformers relies on globally normalized softmax weights, causing all tokens to compete for influence at every layer.<n>We introduce Krause Attention, a principled attention mechanism inspired by bounded-confidence consensus dynamics.
arXiv Detail & Related papers (2026-02-12T03:47:53Z)
Attention Needs to Focus: A Unified Perspective on Attention Allocation [37.34801068995858]
The Transformer architecture is a cornerstone of modern Large Language Models (LLMs)<n>Standard attention mechanism is plagued by well-documented issues: representational collapse and attention sink.<n>We present a unified perspective, arguing that both can be traced to a common root -- improper attention allocation.
arXiv Detail & Related papers (2026-01-01T08:39:15Z)
Deconstructing Attention: Investigating Design Principles for Effective Language Modeling [37.92951508140559]
Transformer language models are widely credited with their dot-product attention mechanism.<n>This work systematically deconstructs attention by designing controlled variants that relax these principles.<n>Surprisingly, even variants that fail in isolation can achieve robust performance when interleaved with standard attention.
arXiv Detail & Related papers (2025-10-13T16:42:14Z)
Transformers as Multi-task Learners: Decoupling Features in Hidden Markov Models [12.112842686827669]
Transformer based models have shown remarkable capabilities in sequence learning across a wide range of tasks.<n>We investigate the layerwise behavior of Transformers to uncover the mechanisms underlying their multi-task generalization ability.<n>Our explicit constructions align closely with empirical observations, providing theoretical support for the Transformer's effectiveness and efficiency on sequence learning across diverse tasks.
arXiv Detail & Related papers (2025-06-02T17:39:31Z)
On the Emergence of Position Bias in Transformers [59.87743433861665]
This paper presents a graph-theoretic framework for analyzing position biases in multilayer positions.<n>Our framework offers a principled foundation for understanding positional interplay in transformers.
arXiv Detail & Related papers (2025-02-04T02:53:07Z)
Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs [77.66717051042032]
Practitioners have consistently observed three puzzling phenomena in transformer-based large language models. These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights. We elucidate the mechanisms behind extreme-token phenomena.
arXiv Detail & Related papers (2024-10-17T17:54:06Z)
Self-attention Networks Localize When QK-eigenspectrum Concentrates [9.379890125442335]
Self-attention mechanism prevails in modern machine learning. Two arguments have connected attention localization to the model performances. We show that a small eigenspectrum variance leads attention to be localized.
arXiv Detail & Related papers (2024-02-03T09:35:53Z)
Sim-to-Real Causal Transfer: A Metric Learning Approach to Causally-Aware Interaction Representations [58.96953392466609]
We take an in-depth look at the causal awareness of modern representations of agent interactions.<n>We show that recent representations are already partially resilient to perturbations of non-causal agents.<n>We introduce a metric learning approach that regularizes latent representations with causal annotations.
arXiv Detail & Related papers (2023-12-07T18:57:03Z)
Causal Triplet: An Open Challenge for Intervention-centric Causal Representation Learning [98.78136504619539]
Causal Triplet is a causal representation learning benchmark featuring visually more complex scenes. We show that models built with the knowledge of disentangled or object-centric representations significantly outperform their distributed counterparts.
arXiv Detail & Related papers (2023-01-12T17:43:38Z)
Outliers Dimensions that Disrupt Transformers Are Driven by Frequency [79.22656609637525]
We show that the token frequency contributes to the outlier phenomenon. We also find that, surprisingly, the outlier effect on the model performance varies by layer, and that variance is also related to the correlation between outlier magnitude and encoded token frequency.
arXiv Detail & Related papers (2022-05-23T15:19:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.