Related papers: Using Attention Sinks to Identify and Evaluate Dormant Heads in Pretrained LLMs

Using Attention Sinks to Identify and Evaluate Dormant Heads in Pretrained LLMs

URL: http://arxiv.org/abs/2504.03889v1
Date: Fri, 04 Apr 2025 19:28:23 GMT
Title: Using Attention Sinks to Identify and Evaluate Dormant Heads in Pretrained LLMs
Authors: Pedro Sandoval-Segura, Xijun Wang, Ashwinee Panda, Micah Goldblum, Ronen Basri, Tom Goldstein, David Jacobs,
Abstract summary: We propose a new definition for attention heads dominated by attention sinks, known as dormant attention heads.<n>More than 4% of a model's attention heads can be zeroed while maintaining average accuracy.<n> dormant heads emerge early in pretraining and can transition between dormant and active states during pretraining.
Score: 77.43913758420948
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-head attention is foundational to large language models (LLMs), enabling different heads to have diverse focus on relevant input tokens. However, learned behaviors like attention sinks, where the first token receives most attention despite limited semantic importance, challenge our understanding of multi-head attention. To analyze this phenomenon, we propose a new definition for attention heads dominated by attention sinks, known as dormant attention heads. We compare our definition to prior work in a model intervention study where we test whether dormant heads matter for inference by zeroing out the output of dormant attention heads. Using six pretrained models and five benchmark datasets, we find our definition to be more model and dataset-agnostic. Using our definition on most models, more than 4% of a model's attention heads can be zeroed while maintaining average accuracy, and zeroing more than 14% of a model's attention heads can keep accuracy to within 1% of the pretrained model's average accuracy. Further analysis reveals that dormant heads emerge early in pretraining and can transition between dormant and active states during pretraining. Additionally, we provide evidence that they depend on characteristics of the input text.

Related papers

AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection [9.555955025064895]
We propose AttentionInfluence to identify reasoning-intensive pretraining data.<n>Our approach enables a small pretrained language model to act as a strong data selector through a simple attention head masking operation.<n>Our experimental results demonstrate substantial improvements, ranging from 1.4pp to 3.5pp, across several knowledge-intensive and reasoning-heavy benchmarks.
arXiv Detail & Related papers (2025-05-12T07:25:51Z)
Retrieval Head Mechanistically Explains Long-Context Factuality [56.78951509492645]
We show that a special type of attention heads are largely responsible for retrieving information, which we dub retrieval heads. We show that retrieval heads strongly influence chain-of-thought (CoT) reasoning, where the model needs to frequently refer back the question and previously-generated context. We believe our insights will foster future research on reducing hallucination, improving reasoning, and compressing the KV cache.
arXiv Detail & Related papers (2024-04-24T00:24:03Z)
Beyond Confidence: Reliable Models Should Also Consider Atypicality [43.012818086415514]
We investigate the relationship between how atypical(rare) a sample or a class is and the reliability of a model's predictions. We show that predictions for atypical inputs or atypical classes are more overconfident and have lower accuracy. We propose that models should use not only confidence but also atypicality to improve uncertainty quantification and performance.
arXiv Detail & Related papers (2023-05-29T17:37:09Z)
Revisiting Attention Weights as Explanations from an Information Theoretic Perspective [4.499369811647602]
We show that attention mechanisms have the potential to function as a shortcut to model explanations when they are carefully combined with other model elements. Our findings indicate that attention mechanisms do have the potential to function as a shortcut to model explanations when they are carefully combined with other model elements.
arXiv Detail & Related papers (2022-10-31T12:53:20Z)
Do Transformer Models Show Similar Attention Patterns to Task-Specific Human Gaze? [0.0]
Self-attention functions in state-of-the-art NLP models often correlate with human attention. We investigate whether self-attention in large-scale pre-trained language models is as predictive of human eye fixation patterns during task-reading as classical cognitive models of human attention.
arXiv Detail & Related papers (2022-04-25T08:23:13Z)
Your "Attention" Deserves Attention: A Self-Diversified Multi-Channel Attention for Facial Action Analysis [12.544285462327839]
We propose a compact model to enhance the representational and focusing power of neural attention maps. The proposed method is evaluated on two benchmark databases (BP4D and DISFA) for AU detection and four databases (CK+, MMI, BU-3DFE, and BP4D+) for facial expression recognition. It achieves superior performance compared to the state-of-the-art methods.
arXiv Detail & Related papers (2022-03-23T17:29:51Z)
Attention cannot be an Explanation [99.37090317971312]
We ask how effective are attention based explanations in increasing human trust and reliance in the underlying models? We perform extensive human study experiments that aim to qualitatively and quantitatively assess the degree to which attention based explanations are suitable. Our experiment results show that attention cannot be used as an explanation.
arXiv Detail & Related papers (2022-01-26T21:34:05Z)
Explain, Edit, and Understand: Rethinking User Study Design for Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews. We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z)
SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity. Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism. We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z)
The heads hypothesis: A unifying statistical approach towards understanding multi-headed attention in BERT [18.13834903235249]
Multi-headed attention heads are a mainstay in transformer-based models. Different methods have been proposed to classify the role of each attention head based on the relations between tokens which have high pair-wise attention. We formalize a simple yet effective score that generalizes to all the roles of attention heads and employs hypothesis testing on this score for robust inference.
arXiv Detail & Related papers (2021-01-22T14:10:59Z)
How Well Do Self-Supervised Models Transfer? [92.16372657233394]
We evaluate the transfer performance of 13 top self-supervised models on 40 downstream tasks. We find ImageNet Top-1 accuracy to be highly correlated with transfer to many-shot recognition. No single self-supervised method dominates overall, suggesting that universal pre-training is still unsolved.
arXiv Detail & Related papers (2020-11-26T16:38:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.