Related papers: Singular Vectors of Attention Heads Align with Features

Singular Vectors of Attention Heads Align with Features

URL: http://arxiv.org/abs/2602.13524v1
Date: Fri, 13 Feb 2026 23:30:02 GMT
Title: Singular Vectors of Attention Heads Align with Features
Authors: Gabriel Franco, Carson Loughridge, Mark Crovella,
Abstract summary: We show that singular vectors robustly align with features in a model where features can be directly observed.<n>We then show theoretically that such alignment is expected under a range of conditions.<n>We close by asking how, operationally, alignment may be recognized in real models where feature representations are not directly observable.
Score: 5.2088687180672375
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Identifying feature representations in language models is a central task in mechanistic interpretability. Several recent studies have made an implicit assumption that feature representations can be inferred in some cases from singular vectors of attention matrices. However, sound justification for this assumption is lacking. In this paper we address that question, asking: why and when do singular vectors align with features? First, we demonstrate that singular vectors robustly align with features in a model where features can be directly observed. We then show theoretically that such alignment is expected under a range of conditions. We close by asking how, operationally, alignment may be recognized in real models where feature representations are not directly observable. We identify sparse attention decomposition as a testable prediction of alignment, and show evidence that it emerges in a manner consistent with predictions in real models. Together these results suggest that alignment of singular vectors with features can be a sound and theoretically justified basis for feature identification in language models.

Related papers

From Black-box to Causal-box: Towards Building More Interpretable Models [57.23201263629627]
We introduce the notion of causal interpretability, which formalizes when counterfactual queries can be evaluated from a specific class of models.<n>We derive a complete graphical criterion that determines whether a given model architecture supports a given counterfactual query.
arXiv Detail & Related papers (2025-10-24T20:03:18Z)
Emergence of Quantised Representations Isolated to Anisotropic Functions [0.0]
This paper presents a novel methodology for determining representational structure, which builds upon the existing Spotlight Resonance method.<n>It shows how discrete representations can emerge and organise in autoencoder models, through a controlled ablation study in which only the activation function is altered.<n>Using this technique, the validity of whether function-driven symmetries can act as implicit inductive biases on representations is determined.
arXiv Detail & Related papers (2025-07-16T09:27:54Z)
The Origins of Representation Manifolds in Large Language Models [52.68554895844062]
We show that cosine similarity in representation space may encode the intrinsic geometry of a feature through shortest, on-manifold paths.<n>The critical assumptions and predictions of the theory are validated on text embeddings and token activations of large language models.
arXiv Detail & Related papers (2025-05-23T13:31:22Z)
Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors [61.92704516732144]
We show that the most robust features for correctness prediction are those that play a distinctive causal role in the model's behavior.<n>We propose two methods that leverage causal mechanisms to predict the correctness of model outputs.
arXiv Detail & Related papers (2025-05-17T00:31:39Z)
Predicting the Performance of Black-box LLMs through Self-Queries [60.87193950962585]
Large language models (LLMs) are increasingly relied on in AI systems, predicting when they make mistakes is crucial.<n>In this paper, we extract features of LLMs in a black-box manner by using follow-up prompts and taking the probabilities of different responses as representations.<n>We demonstrate that training a linear model on these low-dimensional representations produces reliable predictors of model performance at the instance level.
arXiv Detail & Related papers (2025-01-02T22:26:54Z)
Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct [0.0]
We find that the Llama3-8b-Instruct chat model can reliably distinguish its own outputs from those of humans.<n>We identify a vector in the residual stream of the model that is differentially activated when the model makes a correct self-written-text recognition judgment.<n>We show that the vector can be used to control both the model's behavior and its perception.
arXiv Detail & Related papers (2024-10-02T22:26:21Z)
Local Topology Measures of Contextual Language Model Latent Spaces With Applications to Dialogue Term Extraction [4.887047578768969]
We introduce complexity measures of the local topology of the latent space of a contextual language model. Our work continues a line of research that explores the manifold hypothesis for word embeddings.
arXiv Detail & Related papers (2024-08-07T11:44:32Z)
Rationalizing Predictions by Adversarial Information Calibration [65.19407304154177]
We train two models jointly: one is a typical neural model that solves the task at hand in an accurate but black-box manner, and the other is a selector-predictor model that additionally produces a rationale for its prediction. We use an adversarial technique to calibrate the information extracted by the two models such that the difference between them is an indicator of the missed or over-selected features.
arXiv Detail & Related papers (2023-01-15T03:13:09Z)
A simple probabilistic neural network for machine understanding [0.0]
We discuss probabilistic neural networks with a fixed internal representation as models for machine understanding. We derive the internal representation by requiring that it satisfies the principles of maximal relevance and of maximal ignorance about how different features are combined. We argue that learning machines with this architecture enjoy a number of interesting properties, like the continuity of the representation with respect to changes in parameters and data.
arXiv Detail & Related papers (2022-10-24T13:00:15Z)
Interpreting Language Models with Contrastive Explanations [99.7035899290924]
Language models must consider various features to predict a token, such as its part of speech, number, tense, or semantics. Existing explanation methods conflate evidence for all these features into a single explanation, which is less interpretable for human understanding. We show that contrastive explanations are quantifiably better than non-contrastive explanations in verifying major grammatical phenomena.
arXiv Detail & Related papers (2022-02-21T18:32:24Z)
The Irrationality of Neural Rationale Models [6.159428088113691]
We argue to the contrary, with both philosophical perspectives and empirical evidence suggesting that rationale models are, perhaps, less rational and interpretable than expected. We call for more rigorous and comprehensive evaluations of these models to ensure desired properties of interpretability are indeed achieved.
arXiv Detail & Related papers (2021-10-14T17:22:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.