Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs
- URL: http://arxiv.org/abs/2410.13835v2
- Date: Thu, 07 Nov 2024 16:57:02 GMT
- Title: Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs
- Authors: Tianyu Guo, Druv Pai, Yu Bai, Jiantao Jiao, Michael I. Jordan, Song Mei,
- Abstract summary: Practitioners have consistently observed three puzzling phenomena in transformer-based large language models.
These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights.
We elucidate the mechanisms behind extreme-token phenomena.
- Score: 77.66717051042032
- License:
- Abstract: Practitioners have consistently observed three puzzling phenomena in transformer-based large language models (LLMs): attention sinks, value-state drains, and residual-state peaks, collectively referred to as extreme-token phenomena. These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights, exhibiting significantly smaller value states, and having much larger residual-state norms than those of other tokens. These extreme tokens give rise to various challenges in LLM inference, quantization, and interpretability. We elucidate the mechanisms behind extreme-token phenomena. First, we show that these phenomena arise in very simple architectures -- transformers with one to three layers -- trained on a toy model, the Bigram-Backcopy (BB) task. In this setting, we identify an active-dormant mechanism, where attention heads become sinks for specific input domains while remaining non-sinks for others. Our theoretical analysis of the training dynamics reveals that these phenomena are driven by a mutual reinforcement mechanism. Building on these insights, we propose strategies to mitigate extreme-token phenomena during pretraining, including replacing softmax with ReLU and Adam with SGD. Next, we extend our analysis to pretrained LLMs, including Llama and OLMo, showing that many attention heads exhibit a similar active-dormant mechanism as in the BB task, and that the mutual reinforcement mechanism also governs the emergence of extreme-token phenomena during LLM pretraining. Our results reveal that many of the static and dynamic properties of extreme-token phenomena predicted by the BB task align with observations in pretrained LLMs.
Related papers
- Systematic Outliers in Large Language Models [41.2150163753952]
Outliers have been widely observed in Large Language Models (LLMs)
We provide a detailed analysis of the formation process, underlying causes, and functions of outliers in LLMs.
arXiv Detail & Related papers (2025-02-10T12:54:17Z) - On the Emergence of Position Bias in Transformers [59.87743433861665]
This paper introduces a novel graph-theoretic framework to analyze position bias in multi-layer attention.
We quantify how tokens interact with contextual information based on their sequential positions.
Our framework offers a principled foundation for understanding positional biases in transformers.
arXiv Detail & Related papers (2025-02-04T02:53:07Z) - Attention Sinks and Outlier Features: A 'Catch, Tag, and Release' Mechanism for Embeddings [4.30907936718325]
Two prominent features of large language models (LLMs) is the presence of large-norm (outlier) features and the tendency for tokens to attend very strongly to a select few tokens.
We show that attention sinks utilize outlier features to: catch a sequence of tokens, tag the captured tokens by applying a common perturbation, and then release the tokens back into the residual stream.
arXiv Detail & Related papers (2025-02-02T21:15:07Z) - Understanding and Mitigating Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing [56.66469232740998]
We show that Structured State Space Models (SSMs) are inherently limited by strong recency bias.
This bias impairs the models' ability to recall distant information and introduces robustness issues.
We propose to polarize two channels of the state transition matrices in SSMs, setting them to zero and one, respectively, simultaneously addressing recency bias and over-smoothing.
arXiv Detail & Related papers (2024-12-31T22:06:39Z) - A phase transition between positional and semantic learning in a solvable model of dot-product attention [30.96921029675713]
Morelinear model dot-product attention is studied as a non-dimensional self-attention layer with trainable and low-dimensional query and key data.
We show that either a positional attention mechanism (with tokens each other based on their respective positions) or a semantic attention mechanism (with tokens tied to each other based their meaning) or a transition from the former to the latter with increasing sample complexity.
arXiv Detail & Related papers (2024-02-06T11:13:54Z) - Understanding Masked Autoencoders via Hierarchical Latent Variable
Models [109.35382136147349]
Masked autoencoder (MAE) has recently achieved prominent success in a variety of vision tasks.
Despite the emergence of intriguing empirical observations on MAE, a theoretically principled understanding is still lacking.
arXiv Detail & Related papers (2023-06-08T03:00:10Z) - Spreading of a local excitation in a Quantum Hierarchical Model [62.997667081978825]
We study the dynamics of the quantum Dyson hierarchical model in its paramagnetic phase.
An initial state made by a local excitation of the paramagnetic ground state is considered.
A localization mechanism is found and the excitation remains close to its initial position at arbitrary times.
arXiv Detail & Related papers (2022-07-14T10:05:20Z) - Realizing a dynamical topological phase in a trapped-ion quantum
simulator [0.0]
Nascent platforms for programmable quantum simulation offer unprecedented access to new regimes of far-from-equilibrium quantum many-body dynamics.
We show how to create, protect, and manipulate quantum entanglement that self-correct against large classes of errors.
Our work paves the way for implementation of more complex dynamical topological orders that would enable error-resilient techniques to manipulate quantum information.
arXiv Detail & Related papers (2021-07-20T18:00:00Z) - Subdiffusion via Disordered Quantum Walks [52.77024349608834]
We experimentally prove the feasibility of disordered quantum walks to realize a quantum simulator that is able to model general subdiffusive phenomena.
Our experiment simulates such phenomena by means of a finely controlled insertion of various levels of disorder during the evolution of the walker.
This allows us to explore the full range of subdiffusive behaviors, ranging from anomalous Anderson localization to normal diffusion.
arXiv Detail & Related papers (2020-07-24T13:56:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.