Related papers: Identifying and Evaluating Inactive Heads in Pretrained LLMs

Identifying and Evaluating Inactive Heads in Pretrained LLMs

URL: http://arxiv.org/abs/2504.03889v3
Date: Wed, 08 Oct 2025 19:58:05 GMT
Title: Identifying and Evaluating Inactive Heads in Pretrained LLMs
Authors: Pedro Sandoval-Segura, Xijun Wang, Ashwinee Panda, Micah Goldblum, Ronen Basri, Tom Goldstein, David Jacobs,
Abstract summary: We propose a taxonomy of 13 score functions that measure different ways a head can be inactive.<n>More than 12% of attention heads are inactive on average, and can be ablated in specific contexts.<n>We show how measuring score distributions can provide insights into attention behavior.
Score: 74.93559410792646
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Attention is foundational to large language models (LLMs), enabling different heads to have diverse focus on relevant input tokens. However, learned behaviors like attention sinks, where the first token receives the most attention despite limited semantic importance, suggest some heads may be inactive, and point to a significant source of computational redundancy. To analyze this phenomenon, we propose a taxonomy of 13 score functions that measure different ways a head can be inactive. Thresholding these scores allows us to analyze different sets of potentially inactive attention heads. We evaluate whether identified heads are inactive through model interventions, finding that more than 12% of attention heads are inactive on average, and can be ablated in specific contexts while maintaining MMLU accuracy to within 1% of the pretrained LLM. Across 3 model families, our score functions that measure the average norm of a head's output consistently identify inactive heads that would not have been found by score functions that rely solely on attention weights. We establish that relying on a score function that measures a first token attention sink would underestimate the prevalence of inactive heads, failing to identify more than 7% of inactive heads on average. We also show how measuring score distributions can provide insights into attention behavior. For instance, we find evidence that finetuning causes little to no change in attention behavior, and that even within the same model family, large model scales present markedly different attention behaviors.

Related papers

Knocking-Heads Attention [36.56180929159062]
Multi-head attention (MHA) has become the cornerstone of modern large language models.<n>Increasing the number of heads inherently weakens individual head capacity.<n>We propose knocking-heads attention (KHA) which enables attention heads to "knock" on each other.
arXiv Detail & Related papers (2025-10-27T06:28:58Z)
Uncertainty-Aware Attention Heads: Efficient Unsupervised Uncertainty Quantification for LLMs [129.79394562739705]
Large language models (LLMs) exhibit impressive fluency, but often produce critical errors known as "hallucinations"<n>We propose RAUQ (Recurrent Attention-based Uncertainty Quantification), an unsupervised approach that leverages intrinsic attention patterns in transformers to detect hallucinations efficiently.<n> Experiments across 4 LLMs and 12 question answering, summarization, and translation tasks demonstrate that RAUQ yields excellent results.
arXiv Detail & Related papers (2025-05-26T14:28:37Z)
AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection [9.555955025064895]
We propose AttentionInfluence to identify reasoning-intensive pretraining data.<n>Our approach enables a small pretrained language model to act as a strong data selector through a simple attention head masking operation.<n>Our experimental results demonstrate substantial improvements, ranging from 1.4pp to 3.5pp, across several knowledge-intensive and reasoning-heavy benchmarks.
arXiv Detail & Related papers (2025-05-12T07:25:51Z)
Don't Take Things Out of Context: Attention Intervention for Enhancing Chain-of-Thought Reasoning in Large Language Models [32.71672086718058]
Few-shot Chain-of-Thought (CoT) significantly enhances the reasoning capabilities of large language models (LLMs)<n>We observe that isolated segments, words, or tokens within CoT demonstrations can unexpectedly disrupt the generation process of LLMs.<n>We propose a Few-shot Attention Intervention method (FAI) that dynamically analyzes the attention patterns of demonstrations to accurately identify these tokens.
arXiv Detail & Related papers (2025-03-14T07:46:33Z)
Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration [15.36841874118801]
We aim to provide a more profound understanding of the existence of attention sinks within large language models (LLMs) We propose a training-free Attention Technique (ACT) that automatically optimize the attention distributions on the fly during inference in an input-adaptive manner. ACT achieves an average improvement of up to 7.30% in accuracy across different datasets when applied to Llama-30B.
arXiv Detail & Related papers (2024-06-22T07:00:43Z)
Retrieval Head Mechanistically Explains Long-Context Factuality [56.78951509492645]
We show that a special type of attention heads are largely responsible for retrieving information, which we dub retrieval heads. We show that retrieval heads strongly influence chain-of-thought (CoT) reasoning, where the model needs to frequently refer back the question and previously-generated context. We believe our insights will foster future research on reducing hallucination, improving reasoning, and compressing the KV cache.
arXiv Detail & Related papers (2024-04-24T00:24:03Z)
Beyond Confidence: Reliable Models Should Also Consider Atypicality [43.012818086415514]
We investigate the relationship between how atypical(rare) a sample or a class is and the reliability of a model's predictions. We show that predictions for atypical inputs or atypical classes are more overconfident and have lower accuracy. We propose that models should use not only confidence but also atypicality to improve uncertainty quantification and performance.
arXiv Detail & Related papers (2023-05-29T17:37:09Z)
Revisiting Attention Weights as Explanations from an Information Theoretic Perspective [4.499369811647602]
We show that attention mechanisms have the potential to function as a shortcut to model explanations when they are carefully combined with other model elements. Our findings indicate that attention mechanisms do have the potential to function as a shortcut to model explanations when they are carefully combined with other model elements.
arXiv Detail & Related papers (2022-10-31T12:53:20Z)
Do Transformer Models Show Similar Attention Patterns to Task-Specific Human Gaze? [0.0]
Self-attention functions in state-of-the-art NLP models often correlate with human attention. We investigate whether self-attention in large-scale pre-trained language models is as predictive of human eye fixation patterns during task-reading as classical cognitive models of human attention.
arXiv Detail & Related papers (2022-04-25T08:23:13Z)
Your "Attention" Deserves Attention: A Self-Diversified Multi-Channel Attention for Facial Action Analysis [12.544285462327839]
We propose a compact model to enhance the representational and focusing power of neural attention maps. The proposed method is evaluated on two benchmark databases (BP4D and DISFA) for AU detection and four databases (CK+, MMI, BU-3DFE, and BP4D+) for facial expression recognition. It achieves superior performance compared to the state-of-the-art methods.
arXiv Detail & Related papers (2022-03-23T17:29:51Z)
Attention cannot be an Explanation [99.37090317971312]
We ask how effective are attention based explanations in increasing human trust and reliance in the underlying models? We perform extensive human study experiments that aim to qualitatively and quantitatively assess the degree to which attention based explanations are suitable. Our experiment results show that attention cannot be used as an explanation.
arXiv Detail & Related papers (2022-01-26T21:34:05Z)
Explain, Edit, and Understand: Rethinking User Study Design for Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews. We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z)
Alignment Attention by Matching Key and Query Distributions [48.93793773929006]
This paper introduces alignment attention that explicitly encourages self-attention to match the distributions of the key and query within each head. It is simple to convert any models with self-attention, including pre-trained ones, to the proposed alignment attention. On a variety of language understanding tasks, we show the effectiveness of our method in accuracy, uncertainty estimation, generalization across domains, and robustness to adversarial attacks.
arXiv Detail & Related papers (2021-10-25T00:54:57Z)
Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification [101.49122450005869]
We present a counterfactual attention learning method to learn more effective attention based on causal inference. Specifically, we analyze the effect of the learned visual attention on network prediction. We evaluate our method on a wide range of fine-grained recognition tasks.
arXiv Detail & Related papers (2021-08-19T14:53:40Z)
Is Sparse Attention more Interpretable? [52.85910570651047]
We investigate how sparsity affects our ability to use attention as an explainability tool. We find that only a weak relationship between inputs and co-indexed intermediate representations exists -- under sparse attention. We observe in this setting that inducing sparsity may make it less plausible that attention can be used as a tool for understanding model behavior.
arXiv Detail & Related papers (2021-06-02T11:42:56Z)
SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity. Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism. We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z)
The heads hypothesis: A unifying statistical approach towards understanding multi-headed attention in BERT [18.13834903235249]
Multi-headed attention heads are a mainstay in transformer-based models. Different methods have been proposed to classify the role of each attention head based on the relations between tokens which have high pair-wise attention. We formalize a simple yet effective score that generalizes to all the roles of attention heads and employs hypothesis testing on this score for robust inference.
arXiv Detail & Related papers (2021-01-22T14:10:59Z)
How Well Do Self-Supervised Models Transfer? [92.16372657233394]
We evaluate the transfer performance of 13 top self-supervised models on 40 downstream tasks. We find ImageNet Top-1 accuracy to be highly correlated with transfer to many-shot recognition. No single self-supervised method dominates overall, suggesting that universal pre-training is still unsolved.
arXiv Detail & Related papers (2020-11-26T16:38:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.