Disentangling Recall and Reasoning in Transformer Models through Layer-wise Attention and Activation Analysis
- URL: http://arxiv.org/abs/2510.03366v1
- Date: Fri, 03 Oct 2025 04:13:06 GMT
- Title: Disentangling Recall and Reasoning in Transformer Models through Layer-wise Attention and Activation Analysis
- Authors: Harshwardhan Fartale, Ashish Kattamuri, Rahul Raja, Arpita Vats, Ishita Prasad, Akshata Kishore Moharir,
- Abstract summary: Distinguishing recall from reasoning is crucial for predicting model generalization.<n>We use controlled datasets of synthetic linguistic puzzles to probe transformer models at the layer, head, and neuron level.<n>Our results provide the first causal evidence that recall and reasoning rely on separable but interacting circuits in transformer models.
- Score: 3.1526281887627587
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based language models excel at both recall (retrieving memorized facts) and reasoning (performing multi-step inference), but whether these abilities rely on distinct internal mechanisms remains unclear. Distinguishing recall from reasoning is crucial for predicting model generalization, designing targeted evaluations, and building safer interventions that affect one ability without disrupting the other.We approach this question through mechanistic interpretability, using controlled datasets of synthetic linguistic puzzles to probe transformer models at the layer, head, and neuron level. Our pipeline combines activation patching and structured ablations to causally measure component contributions to each task type. Across two model families (Qwen and LLaMA), we find that interventions on distinct layers and attention heads lead to selective impairments: disabling identified "recall circuits" reduces fact-retrieval accuracy by up to 15\% while leaving reasoning intact, whereas disabling "reasoning circuits" reduces multi-step inference by a comparable margin. At the neuron level, we observe task-specific firing patterns, though these effects are less robust, consistent with neuronal polysemanticity.Our results provide the first causal evidence that recall and reasoning rely on separable but interacting circuits in transformer models. These findings advance mechanistic interpretability by linking circuit-level structure to functional specialization and demonstrate how controlled datasets and causal interventions can yield mechanistic insights into model cognition, informing safer deployment of large language models.
Related papers
- Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units [34.05875226612676]
We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples.<n>We causally validate that targeted intervention--removing or augmenting a small fraction of high-influence samples--significantly modulates the emergence of interpretable heads.
arXiv Detail & Related papers (2026-01-29T17:06:54Z) - Context Copying Modulation: The Role of Entropy Neurons in Managing Parametric and Contextual Knowledge Conflicts [16.645800301676996]
We show that entropy neurons are responsible for suppressing context copying across a range of Large Language Models.<n>These results enhance our understanding of the internal dynamics of LLMs when handling conflicting information.
arXiv Detail & Related papers (2025-09-12T19:42:16Z) - Selective Induction Heads: How Transformers Select Causal Structures In Context [50.09964990342878]
We introduce a novel framework that showcases transformers' ability to handle causal structures.<n>Our framework varies the causal structure through interleaved Markov chains with different lags while keeping the transition probabilities fixed.<n>This setting unveils the formation of Selective Induction Heads, a new circuit that endows transformers with the ability to select the correct causal structure in-context.
arXiv Detail & Related papers (2025-09-09T23:13:41Z) - Causal Intervention Framework for Variational Auto Encoder Mechanistic Interpretability [0.0]
This paper introduces a comprehensive causal intervention framework for mechanistic interpretability of Variational Autoencoders (VAEs)<n>We develop techniques to identify and analyze "circuit motifs" in VAEs, examining how semantic factors are encoded, processed, and disentangled through the network layers.<n>Results show that our interventions can successfully isolate functional circuits, map computational graphs to causal graphs of semantic factors, and distinguish between polysemantic and monosemantic units.
arXiv Detail & Related papers (2025-05-06T13:40:59Z) - Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning [9.795934690403374]
It is still unclear which multi-step reasoning mechanisms are used by language models to solve such tasks.<n>We employ circuit analysis and self-influence functions to evaluate the changing importance of each token throughout the reasoning process.<n>We demonstrate that the underlying circuits reveal a human-interpretable reasoning process used by the model.
arXiv Detail & Related papers (2025-02-13T07:19:05Z) - Interpreting Affine Recurrence Learning in GPT-style Transformers [54.01174470722201]
In-context learning allows GPT-style transformers to generalize during inference without modifying their weights.
This paper focuses specifically on their ability to learn and predict affine recurrences as an ICL task.
We analyze the model's internal operations using both empirical and theoretical approaches.
arXiv Detail & Related papers (2024-10-22T21:30:01Z) - Counting in Small Transformers: The Delicate Interplay between Attention and Feed-Forward Layers [16.26331213222281]
We analyze the solutions simple transformer blocks implement when tackling the histogram task.<n>This task reveals a complex interplay between predictive performance, vocabulary and embedding sizes, token-mixing mechanisms, and feed-forward layer capacity.
arXiv Detail & Related papers (2024-07-16T09:48:10Z) - Understanding the Language Model to Solve the Symbolic Multi-Step Reasoning Problem from the Perspective of Buffer Mechanism [68.05754701230039]
We construct a symbolic multi-step reasoning task to investigate the information propagation mechanisms in Transformer models.<n>We propose a random matrix-based algorithm to enhance the model's reasoning ability.
arXiv Detail & Related papers (2024-05-24T07:41:26Z) - Estimating the Causal Effects of Natural Logic Features in Transformer-Based NLI Models [16.328341121232484]
We apply causal effect estimation strategies to measure the effect of context interventions.
We investigate robustness to irrelevant changes and sensitivity to impactful changes of Transformers.
arXiv Detail & Related papers (2024-04-03T10:22:35Z) - Neural-Logic Human-Object Interaction Detection [67.4993347702353]
We present L OGIC HOI, a new HOI detector that leverages neural-logic reasoning and Transformer to infer feasible interactions between entities.
Specifically, we modify the self-attention mechanism in vanilla Transformer, enabling it to reason over the human, action, object> triplet and constitute novel interactions.
We formulate these two properties in first-order logic and ground them into continuous space to constrain the learning process of our approach, leading to improved performance and zero-shot generalization capabilities.
arXiv Detail & Related papers (2023-11-16T11:47:53Z) - Interpretable Imitation Learning with Dynamic Causal Relations [65.18456572421702]
We propose to expose captured knowledge in the form of a directed acyclic causal graph.
We also design this causal discovery process to be state-dependent, enabling it to model the dynamics in latent causal graphs.
The proposed framework is composed of three parts: a dynamic causal discovery module, a causality encoding module, and a prediction module, and is trained in an end-to-end manner.
arXiv Detail & Related papers (2023-09-30T20:59:42Z) - Causal Analysis for Robust Interpretability of Neural Networks [0.2519906683279152]
We develop a robust interventional-based method to capture cause-effect mechanisms in pre-trained neural networks.
We apply our method to vision models trained on classification tasks.
arXiv Detail & Related papers (2023-05-15T18:37:24Z) - Towards Robust and Adaptive Motion Forecasting: A Causal Representation
Perspective [72.55093886515824]
We introduce a causal formalism of motion forecasting, which casts the problem as a dynamic process with three groups of latent variables.
We devise a modular architecture that factorizes the representations of invariant mechanisms and style confounders to approximate a causal graph.
Experiment results on synthetic and real datasets show that our three proposed components significantly improve the robustness and reusability of the learned motion representations.
arXiv Detail & Related papers (2021-11-29T18:59:09Z) - A Critical View of the Structural Causal Model [89.43277111586258]
We show that one can identify the cause and the effect without considering their interaction at all.
We propose a new adversarial training method that mimics the disentangled structure of the causal model.
Our multidimensional method outperforms the literature methods on both synthetic and real world datasets.
arXiv Detail & Related papers (2020-02-23T22:52:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.