Learning Multi-Level Features with Matryoshka Sparse Autoencoders
- URL: http://arxiv.org/abs/2503.17547v1
- Date: Fri, 21 Mar 2025 21:43:28 GMT
- Title: Learning Multi-Level Features with Matryoshka Sparse Autoencoders
- Authors: Bart Bussmann, Noa Nabeshima, Adam Karvonen, Neel Nanda,
- Abstract summary: Matryoshka SAEs are a novel variant of the SAE dictionary.<n>We train Matryoshka SAEs on Gemma-2-2B and TinyStories.<n>We find superior performance on sparse probing and targeted concept erasure tasks.
- Score: 2.039341938086125
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting neural networks by extracting the concepts represented in their activations. However, choosing the size of the SAE dictionary (i.e. number of learned concepts) creates a tension: as dictionary size increases to capture more relevant concepts, sparsity incentivizes features to be split or absorbed into more specific features, leaving high-level features missing or warped. We introduce Matryoshka SAEs, a novel variant that addresses these issues by simultaneously training multiple nested dictionaries of increasing size, forcing the smaller dictionaries to independently reconstruct the inputs without using the larger dictionaries. This organizes features hierarchically - the smaller dictionaries learn general concepts, while the larger dictionaries learn more specific concepts, without incentive to absorb the high-level features. We train Matryoshka SAEs on Gemma-2-2B and TinyStories and find superior performance on sparse probing and targeted concept erasure tasks, more disentangled concept representations, and reduced feature absorption. While there is a minor tradeoff with reconstruction performance, we believe Matryoshka SAEs are a superior alternative for practical tasks, as they enable training arbitrarily large SAEs while retaining interpretable features at different levels of abstraction.
Related papers
- Empirical Evaluation of Progressive Coding for Sparse Autoencoders [45.94517951918044]
We show that dictionary importance in vanilla SAEs follows a power law.
We show Matryoshka SAEs exhibit lower reconstruction loss and recaptured language modeling loss.
arXiv Detail & Related papers (2025-04-30T21:08:32Z) - Multi-Sense Embeddings for Language Models and Knowledge Distillation [17.559171180573664]
Transformer-based large language models (LLMs) rely on contextual embeddings which generate different representations for the same token depending on its surrounding context.
We propose multi-sense embeddings as a drop-in replacement for each token in order to capture the range of their uses in a language.
arXiv Detail & Related papers (2025-04-08T13:36:36Z) - Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models [16.894375498353092]
Sparse Autoencoders (SAEs) have emerged as a powerful framework for machine learning interpretability.<n>Existing SAEs exhibit severe instability, as identical models trained on similar datasets can produce sharply different dictionaries.<n>We present Archetypal SAEs, wherein dictionary atoms are constrained to the convex hull of data.
arXiv Detail & Related papers (2025-02-18T14:29:11Z) - Large Concept Models: Language Modeling in a Sentence Representation Space [62.73366944266477]
We present an attempt at an architecture which operates on an explicit higher-level semantic representation, which we name a concept.<n> Concepts are language- and modality-agnostic and represent a higher level idea or action in a flow.<n>We show that our model exhibits impressive zero-shot generalization performance to many languages.
arXiv Detail & Related papers (2024-12-11T23:36:20Z) - Disentangling Dense Embeddings with Sparse Autoencoders [0.0]
Sparse autoencoders (SAEs) have shown promise in extracting interpretable features from complex neural networks.
We present one of the first applications of SAEs to dense text embeddings from large language models.
We show that the resulting sparse representations maintain semantic fidelity while offering interpretability.
arXiv Detail & Related papers (2024-08-01T15:46:22Z) - Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning [0.9374652839580183]
Identifying the features learned by neural networks is a core challenge in mechanistic interpretability.
We propose end-to-end sparse dictionary learning, a method for training SAEs that ensures the features learned are functionally important.
We explore geometric and qualitative differences between e2e SAE features and standard SAE features.
arXiv Detail & Related papers (2024-05-17T17:03:46Z) - Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic
Interpretability: A Case Study on Othello-GPT [59.245414547751636]
We propose a circuit discovery framework alternative to activation patching.
Our framework suffers less from out-of-distribution and proves to be more efficient in terms of complexity.
We dig in a small transformer trained on a synthetic task named Othello and find a number of human-understandable fine-grained circuits inside of it.
arXiv Detail & Related papers (2024-02-19T15:04:53Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Meta-Learning with Variational Semantic Memory for Word Sense
Disambiguation [56.830395467247016]
We propose a model of semantic memory for WSD in a meta-learning setting.
Our model is based on hierarchical variational inference and incorporates an adaptive memory update rule via a hypernetwork.
We show our model advances the state of the art in few-shot WSD, supports effective learning in extremely data scarce scenarios.
arXiv Detail & Related papers (2021-06-05T20:40:01Z) - Anchor & Transform: Learning Sparse Embeddings for Large Vocabularies [60.285091454321055]
We design a simple and efficient embedding algorithm that learns a small set of anchor embeddings and a sparse transformation matrix.
On text classification, language modeling, and movie recommendation benchmarks, we show that ANT is particularly suitable for large vocabulary sizes.
arXiv Detail & Related papers (2020-03-18T13:07:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.