Related papers: Causal Head Gating: A Framework for Interpreting Roles of Attention Heads in Transformers

Causal Head Gating: A Framework for Interpreting Roles of Attention Heads in Transformers

URL: http://arxiv.org/abs/2505.13737v2
Date: Thu, 23 Oct 2025 23:48:30 GMT
Title: Causal Head Gating: A Framework for Interpreting Roles of Attention Heads in Transformers
Authors: Andrew Nam, Henry Conklin, Yukang Yang, Thomas Griffiths, Jonathan Cohen, Sarah-Jane Leslie,
Abstract summary: We present causal head gating (CHG), a scalable method for interpreting the functional roles of attention heads in transformer models.<n>CHG learns soft gates over heads and assigns them a causal taxonomy based on their impact on task performance.<n>We show that CHG scores yield causal, not merely correlational, insight validated via ablation and causal mediation analyses.
Score: 3.9274867826451323
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present causal head gating (CHG), a scalable method for interpreting the functional roles of attention heads in transformer models. CHG learns soft gates over heads and assigns them a causal taxonomy - facilitating, interfering, or irrelevant - based on their impact on task performance. Unlike prior approaches in mechanistic interpretability, which are hypothesis-driven and require prompt templates or target labels, CHG applies directly to any dataset using standard next-token prediction. We evaluate CHG across multiple large language models (LLMs) in the Llama 3 model family and diverse tasks, including syntax, commonsense, and mathematical reasoning, and show that CHG scores yield causal, not merely correlational, insight validated via ablation and causal mediation analyses. We also introduce contrastive CHG, a variant that isolates sub-circuits for specific task components. Our findings reveal that LLMs contain multiple sparse task-sufficient sub-circuits, that individual head roles depend on interactions with others (low modularity), and that instruction following and in-context learning rely on separable mechanisms.

Related papers

Reasoning-Driven Multimodal LLM for Domain Generalization [72.00754603114187]
We study the role of reasoning in domain generalization using DomainBed-Reasoning dataset.<n>We propose RD-MLDG, a framework with two components: MTCT (Multi-Task Cross-Training) and SARR (Self-Aligned Reasoning Regularization)<n>Experiments on standard DomainBed datasets demonstrate that RD-MLDG achieves complementary state-of-the-art performances.
arXiv Detail & Related papers (2026-02-27T08:10:06Z)
Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units [34.05875226612676]
We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples.<n>We causally validate that targeted intervention--removing or augmenting a small fraction of high-influence samples--significantly modulates the emergence of interpretable heads.
arXiv Detail & Related papers (2026-01-29T17:06:54Z)
Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models [66.36240676392502]
Chain-of-thought (CoT) reasoning has become the standard paradigm for enabling Large Language Models (LLMs) to solve complex problems.<n>Recent studies reveal a sharp performance drop in reasoning hop generalization scenarios.<n>We propose test-time correction of reasoning, a lightweight intervention method that dynamically identifies and deactivates ep heads in the reasoning process.
arXiv Detail & Related papers (2026-01-29T03:24:32Z)
Decoupling the Effect of Chain-of-Thought Reasoning: A Human Label Variation Perspective [60.45433515408158]
We show that long Chain-of-Thought (CoT) serves as a decisive decision-maker for the top option but fails to function as a granular distribution calibrator for ambiguous tasks.<n>We observe a distinct "decoupled mechanism": while CoT improves distributional alignment, final accuracy is dictated by CoT content.
arXiv Detail & Related papers (2026-01-06T16:26:40Z)
Emergence of Minimal Circuits for Indirect Object Identification in Attention-Only Transformers [0.10152838128195467]
We train small, attention-only transformers from scratch on a symbolic version of the Indirect Object Identification task.<n>A single-layer model with only two attention heads achieves perfect IOI accuracy, despite lacking residuals and normalization layers.<n>A two-layer, one-head model achieves similar performance by composing information across layers through query-value interactions.
arXiv Detail & Related papers (2025-10-28T22:25:19Z)
DEAL: Disentangling Transformer Head Activations for LLM Steering [19.770342907146965]
We propose a principled causal-attribution framework for identifying behavior-relevant attention heads in transformers.<n>For each head, we train a vector-quantized autoencoder (VQ-AE) on its attention activations.<n>We assess the behavioral relevance of each head by the separability of VQ-AE encodings for behavior-aligned versus behavior-violating responses.
arXiv Detail & Related papers (2025-06-10T02:16:50Z)
Don't Take Things Out of Context: Attention Intervention for Enhancing Chain-of-Thought Reasoning in Large Language Models [32.71672086718058]
Few-shot Chain-of-Thought (CoT) significantly enhances the reasoning capabilities of large language models (LLMs)<n>We observe that isolated segments, words, or tokens within CoT demonstrations can unexpectedly disrupt the generation process of LLMs.<n>We propose a Few-shot Attention Intervention method (FAI) that dynamically analyzes the attention patterns of demonstrations to accurately identify these tokens.
arXiv Detail & Related papers (2025-03-14T07:46:33Z)
A Cooperative Multi-Agent Framework for Zero-Shot Named Entity Recognition [71.61103962200666]
Zero-shot named entity recognition (NER) aims to develop entity recognition systems from unannotated text corpora.<n>Recent work has adapted large language models (LLMs) for zero-shot NER by crafting specialized prompt templates.<n>We introduce the cooperative multi-agent system (CMAS), a novel framework for zero-shot NER.
arXiv Detail & Related papers (2025-02-25T23:30:43Z)
Unveiling Reasoning Thresholds in Language Models: Scaling, Fine-Tuning, and Interpretability through Attention Maps [3.8936716676293917]
This study investigates the in-context learning capabilities of various decoder-only transformer-based language models with different model sizes and training data.<n>We identify a critical parameter threshold (1.6 billion), beyond which reasoning performance improves significantly in tasks such as commonsense reasoning in multiple-choice question answering and deductive reasoning.
arXiv Detail & Related papers (2025-02-21T00:48:32Z)
CausalGym: Benchmarking causal interpretability methods on linguistic tasks [52.61917615039112]
We use CausalGym to benchmark the ability of interpretability methods to causally affect model behaviour. We study the pythia models (14M--6.9B) and assess the causal efficacy of a wide range of interpretability methods. We find that DAS outperforms the other methods, and so we use it to study the learning trajectory of two difficult linguistic phenomena.
arXiv Detail & Related papers (2024-02-19T21:35:56Z)
A Unified Causal View of Instruction Tuning [76.1000380429553]
We develop a meta Structural Causal Model (meta-SCM) to integrate different NLP tasks under a single causal structure of the data. Key idea is to learn task-required causal factors and only use those to make predictions for a given task.
arXiv Detail & Related papers (2024-02-09T07:12:56Z)
A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis [128.0532113800092]
We present a mechanistic interpretation of Transformer-based LMs on arithmetic questions. This provides insights into how information related to arithmetic is processed by LMs.
arXiv Detail & Related papers (2023-05-24T11:43:47Z)
Demystify Self-Attention in Vision Transformers from a Semantic Perspective: Analysis and Application [21.161850569358776]
Self-attention mechanisms have achieved great success in many fields such as computer vision and natural language processing. Many existing vision transformer (ViT) works simply inherent transformer designs from NLP to adapt vision tasks. This paper introduces a typical image processing technique, which maps low-level representations into mid-level spaces, and annotates extensive discrete keypoints with semantically rich information.
arXiv Detail & Related papers (2022-11-13T15:18:31Z)
ER: Equivariance Regularizer for Knowledge Graph Completion [107.51609402963072]
We propose a new regularizer, namely, Equivariance Regularizer (ER) ER can enhance the generalization ability of the model by employing the semantic equivariance between the head and tail entities. The experimental results indicate a clear and substantial improvement over the state-of-the-art relation prediction methods.
arXiv Detail & Related papers (2022-06-24T08:18:05Z)
Towards Unsupervised Content Disentanglement in Sentence Representations via Syntactic Roles [0.9582466286528458]
We develop an Attention-Driven Variational Autoencoder (ADVAE) We show that it is possible to obtain representations of sentences where different syntactic roles correspond to clearly identified latent variables. Our work constitutes a first step towards unsupervised controllable content generation.
arXiv Detail & Related papers (2022-06-22T15:50:01Z)
Effect Identification in Cluster Causal Diagrams [51.42809552422494]
We introduce a new type of graphical model called cluster causal diagrams (for short, C-DAGs) C-DAGs allow for the partial specification of relationships among variables based on limited prior knowledge. We develop the foundations and machinery for valid causal inferences over C-DAGs.
arXiv Detail & Related papers (2022-02-22T21:27:31Z)
The heads hypothesis: A unifying statistical approach towards understanding multi-headed attention in BERT [18.13834903235249]
Multi-headed attention heads are a mainstay in transformer-based models. Different methods have been proposed to classify the role of each attention head based on the relations between tokens which have high pair-wise attention. We formalize a simple yet effective score that generalizes to all the roles of attention heads and employs hypothesis testing on this score for robust inference.
arXiv Detail & Related papers (2021-01-22T14:10:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.