Related papers: Empirical Evaluation of Progressive Coding for Sparse Autoencoders

Empirical Evaluation of Progressive Coding for Sparse Autoencoders

URL: http://arxiv.org/abs/2505.00190v1
Date: Wed, 30 Apr 2025 21:08:32 GMT
Title: Empirical Evaluation of Progressive Coding for Sparse Autoencoders
Authors: Hans Peter, Anders Søgaard,
Abstract summary: We show that dictionary importance in vanilla SAEs follows a power law.<n>We show Matryoshka SAEs exhibit lower reconstruction loss and recaptured language modeling loss.
Score: 45.94517951918044
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparse autoencoders (SAEs) \citep{bricken2023monosemanticity,gao2024scalingevaluatingsparseautoencoders} rely on dictionary learning to extract interpretable features from neural networks at scale in an unsupervised manner, with applications to representation engineering and information retrieval. SAEs are, however, computationally expensive \citep{lieberum2024gemmascopeopensparse}, especially when multiple SAEs of different sizes are needed. We show that dictionary importance in vanilla SAEs follows a power law. We compare progressive coding based on subset pruning of SAEs -- to jointly training nested SAEs, or so-called {\em Matryoshka} SAEs \citep{bussmann2024learning,nabeshima2024Matryoshka} -- on a language modeling task. We show Matryoshka SAEs exhibit lower reconstruction loss and recaptured language modeling loss, as well as higher representational similarity. Pruned vanilla SAEs are more interpretable, however. We discuss the origins and implications of this trade-off.

Related papers

Learning Multi-Level Features with Matryoshka Sparse Autoencoders [2.039341938086125]
Matryoshka SAEs are a novel variant of the SAE dictionary.<n>We train Matryoshka SAEs on Gemma-2-2B and TinyStories.<n>We find superior performance on sparse probing and targeted concept erasure tasks.
arXiv Detail & Related papers (2025-03-21T21:43:28Z)
Tokenized SAEs: Disentangling SAE Reconstructions [0.9821874476902969]
We show that RES-JB SAE features predominantly correspond to simple input statistics. We propose a method that disentangles token reconstruction from feature reconstruction.
arXiv Detail & Related papers (2025-02-24T17:04:24Z)
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders [115.34050914216665]
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models. We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features. We assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models.
arXiv Detail & Related papers (2024-10-27T17:33:49Z)
Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs [0.0]
We present an information-theoretic framework for interpreting SAEs as lossy compression algorithms. We argue that using MDL rather than sparsity may avoid potential pitfalls with naively maximising sparsity.
arXiv Detail & Related papers (2024-10-15T01:38:03Z)
Disentangling Dense Embeddings with Sparse Autoencoders [0.0]
Sparse autoencoders (SAEs) have shown promise in extracting interpretable features from complex neural networks. We present one of the first applications of SAEs to dense text embeddings from large language models. We show that the resulting sparse representations maintain semantic fidelity while offering interpretability.
arXiv Detail & Related papers (2024-08-01T15:46:22Z)
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models [18.77400885091398]
We propose to measure progress in interpretable dictionary learning by working in the setting of LMs trained on chess and Othello transcripts. We introduce a new SAE training technique, $textitp-annealing$, which improves performance on prior unsupervised metrics as well as our new metrics.
arXiv Detail & Related papers (2024-07-31T18:45:13Z)
What Are You Token About? Dense Retrieval as Distributions Over the Vocabulary [68.77983831618685]
We propose to interpret the vector representations produced by dual encoders by projecting them into the model's vocabulary space. We show that the resulting projections contain rich semantic information, and draw connection between them and sparse retrieval.
arXiv Detail & Related papers (2022-12-20T16:03:25Z)
Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types. Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z)
SdAE: Self-distillated Masked Autoencoder [95.3684955370897]
Self-distillated masked AutoEncoder network SdAE is proposed in this paper. With only 300 epochs pre-training, a vanilla ViT-Base model achieves an 84.1% fine-tuning accuracy on ImageNet-1k classification.
arXiv Detail & Related papers (2022-07-31T15:07:25Z)
Multilingual Extraction and Categorization of Lexical Collocations with Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context. Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.