Related papers: Why Linear Interpretability Works: Invariant Subspaces as a Result of Architectural Constraints

Why Linear Interpretability Works: Invariant Subspaces as a Result of Architectural Constraints

URL: http://arxiv.org/abs/2602.09783v1
Date: Tue, 10 Feb 2026 13:42:55 GMT
Title: Why Linear Interpretability Works: Invariant Subspaces as a Result of Architectural Constraints
Authors: Andres Saurez, Yousung Lee, Dongsoo Har,
Abstract summary: We show that Linear probes and sparse autoencoders consistently recover meaningful structure from transformer representations.<n>We formalize this as the emphInvariant Subspace Necessity theorem and derive the emphSelf-Reference Property: tokens directly provide the geometric direction for their associated features.
Score: 5.104181562775778
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Linear probes and sparse autoencoders consistently recover meaningful structure from transformer representations -- yet why should such simple methods succeed in deep, nonlinear systems? We show this is not merely an empirical regularity but a consequence of architectural necessity: transformers communicate information through linear interfaces (attention OV circuits, unembedding matrices), and any semantic feature decoded through such an interface must occupy a context-invariant linear subspace. We formalize this as the \emph{Invariant Subspace Necessity} theorem and derive the \emph{Self-Reference Property}: tokens directly provide the geometric direction for their associated features, enabling zero-shot identification of semantic structure without labeled data or learned probes. Empirical validation in eight classification tasks and four model families confirms the alignment between class tokens and semantically related instances. Our framework provides \textbf{a principled architectural explanation} for why linear interpretability methods work, unifying linear probes and sparse autoencoders.

Related papers

Structural Disentanglement in Bilinear MLPs via Architectural Inductive Bias [0.0]
We argue that failures arise from how models structure their internal representations during training.<n>We show analytically that bilinear parameterizations possess a non-mixing' property under gradient flow conditions.<n>Unlike pointwise nonlinear networks, multiplicative architectures are able to recover true operators aligned with the underlying algebraic structure.
arXiv Detail & Related papers (2026-02-05T13:14:01Z)
Learning Eigenstructures of Unstructured Data Manifolds [47.81117132002129]
We introduce a novel framework that learns a spectral basis for shape and manifold analysis from unstructured data.<n>By replacing the traditional operator selection, construction, and eigendecomposition with a learning-based approach, our framework offers a principled, data-driven alternative to conventional pipelines.
arXiv Detail & Related papers (2025-11-30T22:06:49Z)
Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations [1.0152838128195467]
We construct Transformer models where the embedding layer is entirely frozen.<n>Our method is compatible with any tokenizer, including a novel Unicode-centric tokenizer.<n>Despite the absence of trainable, semantically embeddings, our models converge, generate coherent text, and, critically, outperform architecturally identical models with trainable embeddings.
arXiv Detail & Related papers (2025-07-07T11:17:32Z)
Self-Attention as a Parametric Endofunctor: A Categorical Framework for Transformer Architectures [0.0]
We develop a category-theoretic framework focusing on the linear components of self-attention.<n>We show that the query, key, and value maps naturally define a parametric 1-morphism in the 2-category $mathbfPara(Vect)$.<n> stacking multiple self-attention layers corresponds to constructing the free monad on this endofunctor.
arXiv Detail & Related papers (2025-01-06T11:14:18Z)
Hitting "Probe"rty with Non-Linearity, and More [2.1756081703276]
We reformulate the design of non-linear structural probes making their design simpler yet effective. We qualitatively assess how strongly two words in a sentence are connected in the predicted dependency tree. We find that the radial basis function (RBF) is an effective non-linear probe for the BERT model.
arXiv Detail & Related papers (2024-02-25T18:33:25Z)
Householder Projector for Unsupervised Latent Semantics Discovery [58.92485745195358]
Householder Projector helps StyleGANs to discover more disentangled and precise semantic attributes without sacrificing image fidelity. We integrate our projector into pre-trained StyleGAN2/StyleGAN3 and evaluate the models on several benchmarks.
arXiv Detail & Related papers (2023-07-16T11:43:04Z)
How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding [56.222097640468306]
We provide mechanistic understanding of how transformers learn "semantic structure" We show, through a combination of mathematical analysis and experiments on Wikipedia data, that the embedding layer and the self-attention layer encode the topical structure.
arXiv Detail & Related papers (2023-03-07T21:42:17Z)
Semi-Supervised Manifold Learning with Complexity Decoupled Chart Autoencoders [45.29194877564103]
This work introduces a chart autoencoder with an asymmetric encoding-decoding process that can incorporate additional semi-supervised information such as class labels. We discuss the approximation power of such networks and derive a bound that essentially depends on the intrinsic dimension of the data manifold rather than the dimension of ambient space.
arXiv Detail & Related papers (2022-08-22T19:58:03Z)
Frame Averaging for Equivariant Shape Space Learning [85.42901997467754]
A natural way to incorporate symmetries in shape space learning is to ask that the mapping to the shape space (encoder) and mapping from the shape space (decoder) are equivariant to the relevant symmetries. We present a framework for incorporating equivariance in encoders and decoders by introducing two contributions.
arXiv Detail & Related papers (2021-12-03T06:41:19Z)
A Non-Linear Structural Probe [43.50268085775569]
We study the case of a structural probe, which aims to investigate the encoding of syntactic structure in contextual representations. By observing that the structural probe learns a metric, we are able to kernelize it and develop a novel non-linear variant. We test on 6 languages and find that the radial-basis function (RBF) kernel, in conjunction with regularization, achieves a statistically significant improvement.
arXiv Detail & Related papers (2021-05-21T07:53:10Z)
Unsupervised Distillation of Syntactic Information from Contextualized Word Representations [62.230491683411536]
We tackle the task of unsupervised disentanglement between semantics and structure in neural language representations. To this end, we automatically generate groups of sentences which are structurally similar but semantically different. We demonstrate that our transformation clusters vectors in space by structural properties, rather than by lexical semantics.
arXiv Detail & Related papers (2020-10-11T15:13:18Z)
Deep Hough Transform for Semantic Line Detection [70.28969017874587]
We focus on a fundamental task of detecting meaningful line structures, a.k.a. semantic lines, in natural scenes. Previous methods neglect the inherent characteristics of lines, leading to sub-optimal performance. We propose a one-shot end-to-end learning framework for line detection.
arXiv Detail & Related papers (2020-03-10T13:08:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.