Related papers: Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry

Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry

URL: http://arxiv.org/abs/2503.01822v1
Date: Mon, 03 Mar 2025 18:47:40 GMT
Title: Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry
Authors: Sai Sumedh R. Hindupur, Ekdeep Singh Lubana, Thomas Fel, Demba Ba,
Abstract summary: We introduce a unified framework that recasts SAEs as solutions to a bilevel optimization problem.<n>We show that SAEs fail to recover concepts when these properties are ignored.<n>Our findings challenge the idea of a universal SAE and underscores the need for architecture-specific choices in model interpretability.
Score: 11.968306791864034
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sparse Autoencoders (SAEs) are widely used to interpret neural networks by identifying meaningful concepts from their representations. However, do SAEs truly uncover all concepts a model relies on, or are they inherently biased toward certain kinds of concepts? We introduce a unified framework that recasts SAEs as solutions to a bilevel optimization problem, revealing a fundamental challenge: each SAE imposes structural assumptions about how concepts are encoded in model representations, which in turn shapes what it can and cannot detect. This means different SAEs are not interchangeable -- switching architectures can expose entirely new concepts or obscure existing ones. To systematically probe this effect, we evaluate SAEs across a spectrum of settings: from controlled toy models that isolate key variables, to semi-synthetic experiments on real model activations and finally to large-scale, naturalistic datasets. Across this progression, we examine two fundamental properties that real-world concepts often exhibit: heterogeneity in intrinsic dimensionality (some concepts are inherently low-dimensional, others are not) and nonlinear separability. We show that SAEs fail to recover concepts when these properties are ignored, and we design a new SAE that explicitly incorporates both, enabling the discovery of previously hidden concepts and reinforcing our theoretical insights. Our findings challenge the idea of a universal SAE and underscores the need for architecture-specific choices in model interpretability. Overall, we argue an SAE does not just reveal concepts -- it determines what can be seen at all.

Related papers

From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit [16.996218963146788]
We show that MP-SAE unrolls its encoder into a sequence of residual-guided steps, allowing it to capture hierarchical and nonlinearly accessible features.<n>We also show that the sequential encoder principle of MP-SAE affords an additional benefit of adaptive sparsity at inference time.
arXiv Detail & Related papers (2025-06-03T17:24:55Z)
Towards Better Generalization and Interpretability in Unsupervised Concept-Based Models [9.340843984411137]
This paper introduces a novel unsupervised concept-based model for image classification, named Learnable Concept-Based Model (LCBM)<n>We demonstrate that LCBM surpasses existing unsupervised concept-based models in generalization capability and nearly matches the performance of black-box models.<n>Despite the use of concept embeddings, we maintain model interpretability by means of a local linear combination of concepts.
arXiv Detail & Related papers (2025-06-02T16:26:41Z)
Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations [23.993903128858832]
We develop an evaluation framework featuring realistic scenarios in which adversarial perturbations are crafted to manipulate SAE representations.<n>We find that tiny adversarial input perturbations can effectively manipulate concept-based interpretations in most scenarios.<n>Overall, our results suggest that SAE concept representations are fragile and may be ill-suited for applications in model monitoring and oversight.
arXiv Detail & Related papers (2025-05-21T20:42:05Z)
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? [76.15163242945813]
Large language models (LLMs) have led many to conclude that they exhibit a form of intelligence.<n>We introduce a novel generative model that generates tokens on the basis of human-interpretable concepts represented as latent discrete variables.
arXiv Detail & Related papers (2025-03-12T01:21:17Z)
Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models [16.894375498353092]
Sparse Autoencoders (SAEs) have emerged as a powerful framework for machine learning interpretability.<n>Existing SAEs exhibit severe instability, as identical models trained on similar datasets can produce sharply different dictionaries.<n>We present Archetypal SAEs, wherein dictionary atoms are constrained to the convex hull of data.
arXiv Detail & Related papers (2025-02-18T14:29:11Z)
Sample-efficient Learning of Concepts with Theoretical Guarantees: from Data to Concepts without Interventions [7.3784937557132855]
Concept-based models (CBM) learn interpretable concepts from high-dimensional data, e.g. images, which are used to predict labels.<n>An important issue in CBMs is concept leakage, i.e., spurious information in the learned concepts, which effectively leads to learning "wrong" concepts.<n>We describe a framework that provides theoretical guarantees on the correctness of the learned concepts and on the number of required labels.
arXiv Detail & Related papers (2025-02-10T15:01:56Z)
Explaining Explainability: Recommendations for Effective Use of Concept Activation Vectors [35.37586279472797]
Concept Vector Activations (CAVs) are learnt using a probe dataset of concept exemplars.<n>We investigate three properties of CAVs: inconsistency across layers, (2) entanglement with other concepts, and (3) spatial dependency.<n>We introduce tools designed to detect the presence of these properties, provide insight into how each property can lead to misleading explanations, and provide recommendations to mitigate their impact.
arXiv Detail & Related papers (2024-04-04T17:46:20Z)
Hierarchical Invariance for Robust and Interpretable Vision Tasks at Larger Scales [54.78115855552886]
We show how to construct over-complete invariants with a Convolutional Neural Networks (CNN)-like hierarchical architecture. With the over-completeness, discriminative features w.r.t. the task can be adaptively formed in a Neural Architecture Search (NAS)-like manner. For robust and interpretable vision tasks at larger scales, hierarchical invariant representation can be considered as an effective alternative to traditional CNN and invariants.
arXiv Detail & Related papers (2024-02-23T16:50:07Z)
Do Concept Bottleneck Models Respect Localities? [14.77558378567965]
Concept-based methods explain model predictions using human-understandable concepts. "Localities" involve using only relevant features when predicting a concept's value. CBMs may not capture localities, even when independent concepts are localised to non-overlapping feature subsets.
arXiv Detail & Related papers (2024-01-02T16:05:23Z)
Interpreting Pretrained Language Models via Concept Bottlenecks [55.47515772358389]
Pretrained language models (PLMs) have made significant strides in various natural language processing tasks. The lack of interpretability due to their black-box'' nature poses challenges for responsible implementation. We propose a novel approach to interpreting PLMs by employing high-level, meaningful concepts that are easily understandable for humans.
arXiv Detail & Related papers (2023-11-08T20:41:18Z)
Implicit Concept Removal of Diffusion Models [92.55152501707995]
Text-to-image (T2I) diffusion models often inadvertently generate unwanted concepts such as watermarks and unsafe images. We present the Geom-Erasing, a novel concept removal method based on the geometric-driven control.
arXiv Detail & Related papers (2023-10-09T17:13:10Z)
Multi-dimensional concept discovery (MCD): A unifying framework with completeness guarantees [1.9465727478912072]
We propose Multi-dimensional Concept Discovery (MCD) as an extension of previous approaches that fulfills a completeness relation on the level of concepts. We empirically demonstrate the superiority of MCD against more constrained concept definitions.
arXiv Detail & Related papers (2023-01-27T18:53:19Z)
Concept Gradient: Concept-based Interpretation Without Linear Assumption [77.96338722483226]
Concept Activation Vector (CAV) relies on learning a linear relation between some latent representation of a given model and concepts. We proposed Concept Gradient (CG), extending concept-based interpretation beyond linear concept functions. We demonstrated CG outperforms CAV in both toy examples and real world datasets.
arXiv Detail & Related papers (2022-08-31T17:06:46Z)
Translational Concept Embedding for Generalized Compositional Zero-shot Learning [73.60639796305415]
Generalized compositional zero-shot learning means to learn composed concepts of attribute-object pairs in a zero-shot fashion. This paper introduces a new approach, termed translational concept embedding, to solve these two difficulties in a unified framework.
arXiv Detail & Related papers (2021-12-20T21:27:51Z)
A Minimalist Dataset for Systematic Generalization of Perception, Syntax, and Semantics [131.93113552146195]
We present a new dataset, Handwritten arithmetic with INTegers (HINT), to examine machines' capability of learning generalizable concepts. In HINT, machines are tasked with learning how concepts are perceived from raw signals such as images. We undertake extensive experiments with various sequence-to-sequence models, including RNNs, Transformers, and GPT-3.
arXiv Detail & Related papers (2021-03-02T01:32:54Z)
CURI: A Benchmark for Productive Concept Learning Under Uncertainty [33.83721664338612]
We introduce a new few-shot, meta-learning benchmark, Compositional Reasoning Under Uncertainty (CURI) CURI evaluates different aspects of productive and systematic generalization, including abstract understandings of disentangling, productive generalization, learning operations, variable binding, etc. It also defines a model-independent "compositionality gap" to evaluate the difficulty of generalizing out-of-distribution along each of these axes.
arXiv Detail & Related papers (2020-10-06T16:23:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.