Related papers: Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits

Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits

URL: http://arxiv.org/abs/2511.20273v1
Date: Tue, 25 Nov 2025 12:59:15 GMT
Title: Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits
Authors: Areeb Ahmad, Abhinav Joshi, Ashutosh Modi,
Abstract summary: Transformer-based language models exhibit complex and distributed behavior, yet their internal computations remain poorly understood.<n>Existing interpretability methods treat attention heads and multilayer perceptron layers (MLPs) as indivisible units, overlooking possibilities of functional substructure learned within them.<n>We introduce a more fine-grained perspective that decomposes these components into singular directions, revealing superposed and independent computations within a single head or mechanistic.
Score: 22.333229451408414
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Transformer-based language models exhibit complex and distributed behavior, yet their internal computations remain poorly understood. Existing mechanistic interpretability methods typically treat attention heads and multilayer perceptron layers (MLPs) (the building blocks of a transformer architecture) as indivisible units, overlooking possibilities of functional substructure learned within them. In this work, we introduce a more fine-grained perspective that decomposes these components into orthogonal singular directions, revealing superposed and independent computations within a single head or MLP. We validate our perspective on widely used standard tasks like Indirect Object Identification (IOI), Gender Pronoun (GP), and Greater Than (GT), showing that previously identified canonical functional heads, such as the name mover, encode multiple overlapping subfunctions aligned with distinct singular directions. Nodes in a computational graph, that are previously identified as circuit elements show strong activation along specific low-rank directions, suggesting that meaningful computations reside in compact subspaces. While some directions remain challenging to interpret fully, our results highlight that transformer computations are more distributed, structured, and compositional than previously assumed. This perspective opens new avenues for fine-grained mechanistic interpretability and a deeper understanding of model internals.

Related papers

Transformers converge to invariant algorithmic cores [0.0]
GPT-2 language models govern subject-verb agreement through a single axis that, when flipped, inverts grammatical number across scales.<n>Mechanistic interpretability could benefit from targeting such invariants -- the computational essence -- rather than implementation-specific details.
arXiv Detail & Related papers (2026-02-26T04:09:11Z)
TensorLens: End-to-End Transformer Analysis via High-Order Attention Tensors [53.891337639229285]
We introduce attentionLens, a novel formulation that captures the entire transformer as a single, input-dependent linear operator expressed through a high-order attention-interaction connection.<n>Our experiments demonstrate that the attention tensor can serve as a powerful foundation for developing tools aimed at interpretability and model understanding.
arXiv Detail & Related papers (2026-01-25T19:21:25Z)
On the Emergence of Induction Heads for In-Context Learning [121.64612469118464]
We study the emergence of induction heads, a previously identified mechanism in two-layer transformers.<n>We explain the origin of this structure using a minimal ICL task formulation and a modified transformer architecture.
arXiv Detail & Related papers (2025-11-02T18:12:06Z)
Emergence of Minimal Circuits for Indirect Object Identification in Attention-Only Transformers [0.10152838128195467]
We train small, attention-only transformers from scratch on a symbolic version of the Indirect Object Identification task.<n>A single-layer model with only two attention heads achieves perfect IOI accuracy, despite lacking residuals and normalization layers.<n>A two-layer, one-head model achieves similar performance by composing information across layers through query-value interactions.
arXiv Detail & Related papers (2025-10-28T22:25:19Z)
Unlocking Out-of-Distribution Generalization in Transformers via Recursive Latent Space Reasoning [50.99796659680724]
This work investigates out-of-distribution (OOD) generalization in Transformer networks using a GSM8K-style modular arithmetic on computational graphs task as a testbed.<n>We introduce and explore a set of four architectural mechanisms aimed at enhancing OOD generalization.<n>We complement these empirical results with a detailed mechanistic interpretability analysis that reveals how these mechanisms give rise to robust OOD generalization abilities.
arXiv Detail & Related papers (2025-10-15T21:03:59Z)
On the Existence of Universal Simulators of Attention [17.01811978811789]
We present solutions to identically replicate attention outputs and the underlying elementary matrix and activation operations via RASP.<n>Our proofs, for the first time, show the existence of an algorithmically achievable data-agnostic solution, previously known to be approximated only by learning.
arXiv Detail & Related papers (2025-06-23T15:15:25Z)
RiemannFormer: A Framework for Attention in Curved Spaces [0.43512163406552]
This research endeavors to offer insights into unlocking the further potential of transformer-based architectures.<n>One of the primary motivations is to offer a geometric interpretation for the attention mechanism in transformers.
arXiv Detail & Related papers (2025-06-09T03:56:18Z)
ComplexFormer: Disruptively Advancing Transformer Inference Ability via Head-Specific Complex Vector Attention [9.470124763460904]
This paper introduces ComplexFormer, featuring Complex Multi-Head Attention-CMHA.<n>CMHA empowers each head to independently model semantic and positional differences unified within the complex plane.<n>Tests show ComplexFormer achieves superior performance, significantly lower generation perplexity, and improved long-context coherence.
arXiv Detail & Related papers (2025-05-15T12:30:33Z)
Interpreting Affine Recurrence Learning in GPT-style Transformers [54.01174470722201]
In-context learning allows GPT-style transformers to generalize during inference without modifying their weights. This paper focuses specifically on their ability to learn and predict affine recurrences as an ICL task. We analyze the model's internal operations using both empirical and theoretical approaches.
arXiv Detail & Related papers (2024-10-22T21:30:01Z)
Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z)
How Do Transformers Learn In-Context Beyond Simple Functions? A Case Study on Learning with Representations [98.7450564309923]
This paper takes initial steps on understanding in-context learning (ICL) in more complex scenarios, by studying learning with representations. We construct synthetic in-context learning problems with a compositional structure, where the label depends on the input through a possibly complex but fixed representation function. We show theoretically the existence of transformers that approximately implement such algorithms with mild depth and size.
arXiv Detail & Related papers (2023-10-16T17:40:49Z)
How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding [56.222097640468306]
We provide mechanistic understanding of how transformers learn "semantic structure" We show, through a combination of mathematical analysis and experiments on Wikipedia data, that the embedding layer and the self-attention layer encode the topical structure.
arXiv Detail & Related papers (2023-03-07T21:42:17Z)
Inductive Biases and Variable Creation in Self-Attention Mechanisms [25.79946667926312]
This work provides a theoretical analysis of the inductive biases of self-attention modules. Our focus is to rigorously establish which functions and long-range dependencies self-attention blocks prefer to represent. Our main result shows that bounded-norm Transformer layers create sparse variables.
arXiv Detail & Related papers (2021-10-19T16:36:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.