Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs
- URL: http://arxiv.org/abs/2410.13835v2
- Date: Thu, 07 Nov 2024 16:57:02 GMT
- Title: Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs
- Authors: Tianyu Guo, Druv Pai, Yu Bai, Jiantao Jiao, Michael I. Jordan, Song Mei,
- Abstract summary: Practitioners have consistently observed three puzzling phenomena in transformer-based large language models.
These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights.
We elucidate the mechanisms behind extreme-token phenomena.
- Score: 77.66717051042032
- License:
- Abstract: Practitioners have consistently observed three puzzling phenomena in transformer-based large language models (LLMs): attention sinks, value-state drains, and residual-state peaks, collectively referred to as extreme-token phenomena. These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights, exhibiting significantly smaller value states, and having much larger residual-state norms than those of other tokens. These extreme tokens give rise to various challenges in LLM inference, quantization, and interpretability. We elucidate the mechanisms behind extreme-token phenomena. First, we show that these phenomena arise in very simple architectures -- transformers with one to three layers -- trained on a toy model, the Bigram-Backcopy (BB) task. In this setting, we identify an active-dormant mechanism, where attention heads become sinks for specific input domains while remaining non-sinks for others. Our theoretical analysis of the training dynamics reveals that these phenomena are driven by a mutual reinforcement mechanism. Building on these insights, we propose strategies to mitigate extreme-token phenomena during pretraining, including replacing softmax with ReLU and Adam with SGD. Next, we extend our analysis to pretrained LLMs, including Llama and OLMo, showing that many attention heads exhibit a similar active-dormant mechanism as in the BB task, and that the mutual reinforcement mechanism also governs the emergence of extreme-token phenomena during LLM pretraining. Our results reveal that many of the static and dynamic properties of extreme-token phenomena predicted by the BB task align with observations in pretrained LLMs.
Related papers
- Mamba-PTQ: Outlier Channels in Recurrent Large Language Models [49.1574468325115]
We show that Mamba models exhibit the same pattern of outlier channels observed in attention-based LLMs.
We show that the reason for the difficulty of quantizing SSMs is caused by activation outliers, similar to those observed in transformer-based LLMs.
arXiv Detail & Related papers (2024-07-17T08:21:06Z) - A phase transition between positional and semantic learning in a solvable model of dot-product attention [30.96921029675713]
Morelinear model dot-product attention is studied as a non-dimensional self-attention layer with trainable and low-dimensional query and key data.
We show that either a positional attention mechanism (with tokens each other based on their respective positions) or a semantic attention mechanism (with tokens tied to each other based their meaning) or a transition from the former to the latter with increasing sample complexity.
arXiv Detail & Related papers (2024-02-06T11:13:54Z) - Entanglement Dynamics in Monitored Systems and the Role of Quantum Jumps [0.0]
We study the effect of quantum jumps on the entanglement dynamics beyond the no-click limit corresponding to a deterministic non-Hermitian evolution.
We show that significant deviations from the noclick limit arise whenever quantum jumps strongly renormalize the non-Hermitian dynamics.
arXiv Detail & Related papers (2023-12-20T20:44:18Z) - Understanding Masked Autoencoders via Hierarchical Latent Variable
Models [109.35382136147349]
Masked autoencoder (MAE) has recently achieved prominent success in a variety of vision tasks.
Despite the emergence of intriguing empirical observations on MAE, a theoretically principled understanding is still lacking.
arXiv Detail & Related papers (2023-06-08T03:00:10Z) - Scrambling and operator entanglement in local non-Hermitian quantum
systems [0.0]
We study information scrambling and quantum chaos in non-Hermitian variants of paradigmatic local quantum spin-chain models.
We extend operator entanglement based diagnostics from previous works on closed and open quantum systems to the new arena of monitored quantum dynamics.
arXiv Detail & Related papers (2023-05-20T01:35:38Z) - Universality of critical dynamics with finite entanglement [68.8204255655161]
We study how low-energy dynamics of quantum systems near criticality are modified by finite entanglement.
Our result establishes the precise role played by entanglement in time-dependent critical phenomena.
arXiv Detail & Related papers (2023-01-23T19:23:54Z) - Strong coupling, weak impact: Phonon coupling versus pure dephasing in
the photon statistics of cooperative emitters [0.0]
We show how access to weaker dephasing mechanisms can be obtained for optically active qubits by performing two-photon coincidence measurements.
We focus on the typically dominant deformation-potential coupling to longitudinal acoustic phonons.
Surprisingly, the impact of the strongly coupled phonon environment is weak and leads to long-lived coherences.
arXiv Detail & Related papers (2022-08-30T21:38:27Z) - Spreading of a local excitation in a Quantum Hierarchical Model [62.997667081978825]
We study the dynamics of the quantum Dyson hierarchical model in its paramagnetic phase.
An initial state made by a local excitation of the paramagnetic ground state is considered.
A localization mechanism is found and the excitation remains close to its initial position at arbitrary times.
arXiv Detail & Related papers (2022-07-14T10:05:20Z) - Realizing a dynamical topological phase in a trapped-ion quantum
simulator [0.0]
Nascent platforms for programmable quantum simulation offer unprecedented access to new regimes of far-from-equilibrium quantum many-body dynamics.
We show how to create, protect, and manipulate quantum entanglement that self-correct against large classes of errors.
Our work paves the way for implementation of more complex dynamical topological orders that would enable error-resilient techniques to manipulate quantum information.
arXiv Detail & Related papers (2021-07-20T18:00:00Z) - Subdiffusion via Disordered Quantum Walks [52.77024349608834]
We experimentally prove the feasibility of disordered quantum walks to realize a quantum simulator that is able to model general subdiffusive phenomena.
Our experiment simulates such phenomena by means of a finely controlled insertion of various levels of disorder during the evolution of the walker.
This allows us to explore the full range of subdiffusive behaviors, ranging from anomalous Anderson localization to normal diffusion.
arXiv Detail & Related papers (2020-07-24T13:56:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.