Related papers: Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs

Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs

URL: http://arxiv.org/abs/2410.13835v2
Date: Thu, 07 Nov 2024 16:57:02 GMT
Title: Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs
Authors: Tianyu Guo, Druv Pai, Yu Bai, Jiantao Jiao, Michael I. Jordan, Song Mei,
Abstract summary: Practitioners have consistently observed three puzzling phenomena in transformer-based large language models. These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights. We elucidate the mechanisms behind extreme-token phenomena.
Score: 77.66717051042032
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Practitioners have consistently observed three puzzling phenomena in transformer-based large language models (LLMs): attention sinks, value-state drains, and residual-state peaks, collectively referred to as extreme-token phenomena. These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights, exhibiting significantly smaller value states, and having much larger residual-state norms than those of other tokens. These extreme tokens give rise to various challenges in LLM inference, quantization, and interpretability. We elucidate the mechanisms behind extreme-token phenomena. First, we show that these phenomena arise in very simple architectures -- transformers with one to three layers -- trained on a toy model, the Bigram-Backcopy (BB) task. In this setting, we identify an active-dormant mechanism, where attention heads become sinks for specific input domains while remaining non-sinks for others. Our theoretical analysis of the training dynamics reveals that these phenomena are driven by a mutual reinforcement mechanism. Building on these insights, we propose strategies to mitigate extreme-token phenomena during pretraining, including replacing softmax with ReLU and Adam with SGD. Next, we extend our analysis to pretrained LLMs, including Llama and OLMo, showing that many attention heads exhibit a similar active-dormant mechanism as in the BB task, and that the mutual reinforcement mechanism also governs the emergence of extreme-token phenomena during LLM pretraining. Our results reveal that many of the static and dynamic properties of extreme-token phenomena predicted by the BB task align with observations in pretrained LLMs.

Related papers

Artifacts and Attention Sinks: Structured Approximations for Efficient Vision Transformers [8.486148475471271]
Vision transformers have emerged as a powerful tool across a wide range of applications, yet their inner workings remain only partially understood.<n>We examine the phenomenon of massive tokens - tokens with exceptionally high activation norms that act as attention sinks - and artifact tokens that emerge as a byproduct during inference.<n>We introduce Fast Nystr"om Attention (FNA), a training-free method that approximates self-attention in linear time and space.
arXiv Detail & Related papers (2025-07-21T19:29:03Z)
Models of Heavy-Tailed Mechanistic Universality [62.107333654304014]
We propose a family of random matrix models to explore attributes that give rise to heavy-tailed behavior in trained neural networks.<n>Under this model, spectral densities with power laws on tails arise through a combination of three independent factors.<n> Implications of our model on other appearances of heavy tails, including neural scaling laws, trajectories, and the five-plus-one phases of neural network training, are discussed.
arXiv Detail & Related papers (2025-06-04T00:55:01Z)
Mitigating Overthinking in Large Reasoning Models via Manifold Steering [32.666911833023526]
Large Reasoning Models (LRMs) exhibit a phenomenon known as overthinking during inference.<n>We propose Manifold Steering, a novel approach that elegantly projects the steering direction onto the low-dimensional activation manifold.<n>Our method reduces output tokens by up to 71% while maintaining and even improving the accuracy on several mathematical benchmarks.
arXiv Detail & Related papers (2025-05-28T14:39:26Z)
Liouvillean Spectral Transition in Noisy Quantum Many-Body Scars [11.834849388804832]
We show that scarred eigenmodes of the Liouvillean exhibit a transition reminiscent of spontaneous $mathbbPT$-symmetry breaking as the dephasing strength increases. Remarkably, in platforms such as the XY spin ladder and PXP model of Rydberg atom arrays, the critical dephasing rate shows only weak dependence on the system size.
arXiv Detail & Related papers (2025-04-16T17:55:02Z)
Enhancing Revivals Via Projective Measurements in a Quantum Scarred System [51.3422222472898]
Quantum many-body scarred systems exhibit atypical dynamical behavior, evading thermalization and featuring periodic state revivals. We investigate the impact of projective measurements on the dynamics in the scar subspace for the paradigmatic PXP model. We identify a measurement-induced phase resynchronization, countering the natural dephasing of quantum scars, as the key mechanism underlying this phenomenon.
arXiv Detail & Related papers (2025-03-28T17:03:14Z)
Don't Take Things Out of Context: Attention Intervention for Enhancing Chain-of-Thought Reasoning in Large Language Models [32.71672086718058]
Few-shot Chain-of-Thought (CoT) significantly enhances the reasoning capabilities of large language models (LLMs) We observe that isolated segments, words, or tokens within CoT demonstrations can unexpectedly disrupt the generation process of LLMs. We propose a Few-shot Attention Intervention method (FAI) that dynamically analyzes the attention patterns of demonstrations to accurately identify these tokens.
arXiv Detail & Related papers (2025-03-14T07:46:33Z)
Attention Reallocation: Towards Zero-cost and Controllable Hallucination Mitigation of MLLMs [62.9348974370985]
We propose attention reallocation (AttnReal) to mitigate hallucinations with nearly zero extra cost. Our approach is motivated by the key observations that, MLLM's unreasonable attention distribution causes features to be dominated by historical output tokens. Based on the observations, AttnReal recycles excessive attention from output tokens and reallocates it to visual tokens, which reduces MLLM's reliance on language priors.
arXiv Detail & Related papers (2025-03-11T11:52:37Z)
Systematic Outliers in Large Language Models [41.2150163753952]
Outliers have been widely observed in Large Language Models (LLMs) We provide a detailed analysis of the formation process, underlying causes, and functions of outliers in LLMs.
arXiv Detail & Related papers (2025-02-10T12:54:17Z)
On the Emergence of Position Bias in Transformers [59.87743433861665]
This paper introduces a novel graph-theoretic framework to analyze position bias in multi-layer attention. We quantify how tokens interact with contextual information based on their sequential positions. Our framework offers a principled foundation for understanding positional biases in transformers.
arXiv Detail & Related papers (2025-02-04T02:53:07Z)
Attention Sinks and Outlier Features: A 'Catch, Tag, and Release' Mechanism for Embeddings [4.30907936718325]
Two prominent features of large language models (LLMs) is the presence of large-norm (outlier) features and the tendency for tokens to attend very strongly to a select few tokens. We show that attention sinks utilize outlier features to: catch a sequence of tokens, tag the captured tokens by applying a common perturbation, and then release the tokens back into the residual stream.
arXiv Detail & Related papers (2025-02-02T21:15:07Z)
Understanding and Mitigating Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing [56.66469232740998]
We show that Structured State Space Models (SSMs) are inherently limited by strong recency bias. This bias impairs the models' ability to recall distant information and introduces robustness issues. We propose to polarize two channels of the state transition matrices in SSMs, setting them to zero and one, respectively, simultaneously addressing recency bias and over-smoothing.
arXiv Detail & Related papers (2024-12-31T22:06:39Z)
Mamba-PTQ: Outlier Channels in Recurrent Large Language Models [49.1574468325115]
We show that Mamba models exhibit the same pattern of outlier channels observed in attention-based LLMs. We show that the reason for the difficulty of quantizing SSMs is caused by activation outliers, similar to those observed in transformer-based LLMs.
arXiv Detail & Related papers (2024-07-17T08:21:06Z)
A phase transition between positional and semantic learning in a solvable model of dot-product attention [30.96921029675713]
Morelinear model dot-product attention is studied as a non-dimensional self-attention layer with trainable and low-dimensional query and key data. We show that either a positional attention mechanism (with tokens each other based on their respective positions) or a semantic attention mechanism (with tokens tied to each other based their meaning) or a transition from the former to the latter with increasing sample complexity.
arXiv Detail & Related papers (2024-02-06T11:13:54Z)
Entanglement Dynamics in Monitored Systems and the Role of Quantum Jumps [0.0]
We study the effect of quantum jumps on the entanglement dynamics beyond the no-click limit corresponding to a deterministic non-Hermitian evolution. We show that significant deviations from the noclick limit arise whenever quantum jumps strongly renormalize the non-Hermitian dynamics.
arXiv Detail & Related papers (2023-12-20T20:44:18Z)
Understanding Masked Autoencoders via Hierarchical Latent Variable Models [109.35382136147349]
Masked autoencoder (MAE) has recently achieved prominent success in a variety of vision tasks. Despite the emergence of intriguing empirical observations on MAE, a theoretically principled understanding is still lacking.
arXiv Detail & Related papers (2023-06-08T03:00:10Z)
Scrambling and operator entanglement in local non-Hermitian quantum systems [0.0]
We study information scrambling and quantum chaos in non-Hermitian variants of paradigmatic local quantum spin-chain models. We extend operator entanglement based diagnostics from previous works on closed and open quantum systems to the new arena of monitored quantum dynamics.
arXiv Detail & Related papers (2023-05-20T01:35:38Z)
Calibrating Undisciplined Over-Smoothing in Transformer for Weakly Supervised Semantic Segmentation [51.14107156747967]
Weakly supervised semantic segmentation (WSSS) has attracted considerable attention because it requires fewer annotations than fully supervised approaches.<n>We propose an Adaptive Re-Activation Mechanism (AReAM) to control deep-level attention to undisciplined over-smoothing.<n>AReAM substantially improves segmentation performance compared with existing WSSS methods, reducing noise while sharpening focus on relevant semantic regions.
arXiv Detail & Related papers (2023-05-04T19:11:33Z)
Universality of critical dynamics with finite entanglement [68.8204255655161]
We study how low-energy dynamics of quantum systems near criticality are modified by finite entanglement. Our result establishes the precise role played by entanglement in time-dependent critical phenomena.
arXiv Detail & Related papers (2023-01-23T19:23:54Z)
Strong coupling, weak impact: Phonon coupling versus pure dephasing in the photon statistics of cooperative emitters [0.0]
We show how access to weaker dephasing mechanisms can be obtained for optically active qubits by performing two-photon coincidence measurements. We focus on the typically dominant deformation-potential coupling to longitudinal acoustic phonons. Surprisingly, the impact of the strongly coupled phonon environment is weak and leads to long-lived coherences.
arXiv Detail & Related papers (2022-08-30T21:38:27Z)
Spreading of a local excitation in a Quantum Hierarchical Model [62.997667081978825]
We study the dynamics of the quantum Dyson hierarchical model in its paramagnetic phase. An initial state made by a local excitation of the paramagnetic ground state is considered. A localization mechanism is found and the excitation remains close to its initial position at arbitrary times.
arXiv Detail & Related papers (2022-07-14T10:05:20Z)
Realizing a dynamical topological phase in a trapped-ion quantum simulator [0.0]
Nascent platforms for programmable quantum simulation offer unprecedented access to new regimes of far-from-equilibrium quantum many-body dynamics. We show how to create, protect, and manipulate quantum entanglement that self-correct against large classes of errors. Our work paves the way for implementation of more complex dynamical topological orders that would enable error-resilient techniques to manipulate quantum information.
arXiv Detail & Related papers (2021-07-20T18:00:00Z)
Subdiffusion via Disordered Quantum Walks [52.77024349608834]
We experimentally prove the feasibility of disordered quantum walks to realize a quantum simulator that is able to model general subdiffusive phenomena. Our experiment simulates such phenomena by means of a finely controlled insertion of various levels of disorder during the evolution of the walker. This allows us to explore the full range of subdiffusive behaviors, ranging from anomalous Anderson localization to normal diffusion.
arXiv Detail & Related papers (2020-07-24T13:56:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.