Related papers: On the Theoretical Foundation of Sparse Dictionary Learning in Mechanistic Interpretability

On the Theoretical Foundation of Sparse Dictionary Learning in Mechanistic Interpretability

URL: http://arxiv.org/abs/2512.05534v1
Date: Fri, 05 Dec 2025 08:47:19 GMT
Title: On the Theoretical Foundation of Sparse Dictionary Learning in Mechanistic Interpretability
Authors: Yiming Tang, Harshvardhan Saini, Yizhen Liao, Dianbo Liu,
Abstract summary: We develop the first unified theoretical framework considering sparse dictionary learning (SDL) as one unified optimization problem.<n>We provide the first theoretical explanations for some empirically observed phenomena, including feature absorption, dead neurons, and the neuron resampling technique.
Score: 5.009082958329585
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As AI models achieve remarkable capabilities across diverse domains, understanding what representations they learn and how they process information has become increasingly important for both scientific progress and trustworthy deployment. Recent works in mechanistic interpretability have shown that neural networks represent meaningful concepts as directions in their representation spaces and often encode many concepts in superposition. Various sparse dictionary learning (SDL) methods, including sparse autoencoders, transcoders, and crosscoders, address this by training auxiliary models with sparsity constraints to disentangle these superposed concepts into interpretable features. These methods have demonstrated remarkable empirical success but have limited theoretical understanding. Existing theoretical work is limited to sparse autoencoders with tied-weight constraints, leaving the broader family of SDL methods without formal grounding. In this work, we develop the first unified theoretical framework considering SDL as one unified optimization problem. We demonstrate how diverse methods instantiate the theoretical framwork and provide rigorous analysis on the optimization landscape. We provide the first theoretical explanations for some empirically observed phenomena, including feature absorption, dead neurons, and the neuron resampling technique. We further design controlled experiments to validate our theoretical results.

Related papers

Deep Unfolding: Recent Developments, Theory, and Design Guidelines [99.63555420898554]
This article provides a tutorial-style overview of deep unfolding, a framework that transforms optimization algorithms into structured, trainable ML architectures.<n>We review the foundations of optimization for inference and for learning, introduce four representative design paradigms for deep unfolding, and discuss the distinctive training schemes that arise from their iterative nature.
arXiv Detail & Related papers (2025-12-03T13:16:35Z)
Fourier Neural Operators Explained: A Practical Perspective [75.12291469255794]
The Fourier Neural Operator (FNO) has become the most influential and widely adopted due to its elegant spectral formulation.<n>This guide aims to establish a clear and reliable framework for applying FNOs effectively across diverse scientific and engineering fields.
arXiv Detail & Related papers (2025-12-01T08:56:21Z)
How LLMs Learn to Reason: A Complex Network Perspective [14.638878448692493]
Training large language models with Reinforcement Learning from Verifiable Rewards exhibits a set of puzzling behaviors.<n>We propose that these seemingly disparate phenomena can be explained using a single unifying theory.<n>Our work provides a new physical intuition for engineering the emergent reasoning capabilities of future AI systems.
arXiv Detail & Related papers (2025-09-28T04:10:37Z)
Large Language Models as Computable Approximations to Solomonoff Induction [11.811838796672369]
We establish the first formal connection between large language models (LLMs) and Algorithmic Information Theory (AIT)<n>We leverage AIT to provide a unified theoretical explanation for in-context learning, few-shot learning, and scaling laws.<n>Our framework bridges the gap between theoretical foundations and practical LLM behaviors, providing both explanatory power and actionable insights for future model development.
arXiv Detail & Related papers (2025-05-21T17:35:08Z)
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? [76.15163242945813]
Large language models (LLMs) have led many to conclude that they exhibit a form of intelligence.<n>We introduce a novel generative model that generates tokens on the basis of human-interpretable concepts represented as latent discrete variables.
arXiv Detail & Related papers (2025-03-12T01:21:17Z)
Sample-efficient Learning of Concepts with Theoretical Guarantees: from Data to Concepts without Interventions [13.877511370053794]
Concept Bottleneck Models (CBM) address some of these challenges by learning interpretable concepts from high-dimensional data.<n>We describe a framework that provides theoretical guarantees on the correctness of the learned concepts and on the number of required labels.<n>We evaluate our framework in synthetic and image benchmarks, showing that the learned concepts have less impurities and are often more accurate than other CBMs.
arXiv Detail & Related papers (2025-02-10T15:01:56Z)
Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning [53.685764040547625]
Transformer-based large language models (LLMs) have displayed remarkable creative prowess and emergence capabilities.<n>This work provides a fine mathematical analysis to show how transformers leverage the multi-concept semantics of words to enable powerful ICL and excellent out-of-distribution ICL abilities.
arXiv Detail & Related papers (2024-11-04T15:54:32Z)
Coding for Intelligence from the Perspective of Category [66.14012258680992]
Coding targets compressing and reconstructing data, and intelligence. Recent trends demonstrate the potential homogeneity of these two fields. We propose a novel problem of Coding for Intelligence from the category theory view.
arXiv Detail & Related papers (2024-07-01T07:05:44Z)
Interpretable Neural-Symbolic Concept Reasoning [7.1904050674791185]
Concept-based models aim to address this issue by learning tasks based on a set of human-understandable concepts. We propose the Deep Concept Reasoner (DCR), the first interpretable concept-based model that builds upon concept embeddings.
arXiv Detail & Related papers (2023-04-27T09:58:15Z)
From Attribution Maps to Human-Understandable Explanations through Concept Relevance Propagation [16.783836191022445]
The field of eXplainable Artificial Intelligence (XAI) aims to bring transparency to today's powerful but opaque deep learning models. While local XAI methods explain individual predictions in form of attribution maps, global explanation techniques visualize what concepts a model has generally learned to encode.
arXiv Detail & Related papers (2022-06-07T12:05:58Z)
A Chain Graph Interpretation of Real-World Neural Networks [58.78692706974121]
We propose an alternative interpretation that identifies NNs as chain graphs (CGs) and feed-forward as an approximate inference procedure. The CG interpretation specifies the nature of each NN component within the rich theoretical framework of probabilistic graphical models. We demonstrate with concrete examples that the CG interpretation can provide novel theoretical support and insights for various NN techniques.
arXiv Detail & Related papers (2020-06-30T14:46:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.