Tokenized SAEs: Disentangling SAE Reconstructions
- URL: http://arxiv.org/abs/2502.17332v1
- Date: Mon, 24 Feb 2025 17:04:24 GMT
- Title: Tokenized SAEs: Disentangling SAE Reconstructions
- Authors: Thomas Dooms, Daniel Wilhelm,
- Abstract summary: We show that RES-JB SAE features predominantly correspond to simple input statistics.<n>We propose a method that disentangles token reconstruction from feature reconstruction.
- Score: 0.9821874476902969
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sparse auto-encoders (SAEs) have become a prevalent tool for interpreting language models' inner workings. However, it is unknown how tightly SAE features correspond to computationally important directions in the model. This work empirically shows that many RES-JB SAE features predominantly correspond to simple input statistics. We hypothesize this is caused by a large class imbalance in training data combined with a lack of complex error signals. To reduce this behavior, we propose a method that disentangles token reconstruction from feature reconstruction. This improvement is achieved by introducing a per-token bias, which provides an enhanced baseline for interesting reconstruction. As a result, significantly more interesting features and improved reconstruction in sparse regimes are learned.
Related papers
- Empirical Evaluation of Progressive Coding for Sparse Autoencoders [45.94517951918044]
We show that dictionary importance in vanilla SAEs follows a power law.
We show Matryoshka SAEs exhibit lower reconstruction loss and recaptured language modeling loss.
arXiv Detail & Related papers (2025-04-30T21:08:32Z) - Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need [0.0]
Sparse autoencoders (SAEs) are widely used for interpreting language model activations.
Recent work introduced training SAEs directly with a combination of KL divergence and MSE.
We propose a brief KL+MSE fine-tuning step applied only to the final 25M training tokens.
arXiv Detail & Related papers (2025-03-21T16:15:49Z) - Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders [8.003244901104111]
We propose a regularization technique for improving feature learning by encouraging SAEs trained in parallel to learn similar features.
textscMFR can improve the reconstruction loss of SAEs by up to 21.21% on GPT-2 Small, and 6.67% on EEG data.
arXiv Detail & Related papers (2024-11-02T11:42:23Z) - Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs [0.0]
We present an information-theoretic framework for interpreting SAEs as lossy compression algorithms.
We argue that using MDL rather than sparsity may avoid potential pitfalls with naively maximising sparsity.
arXiv Detail & Related papers (2024-10-15T01:38:03Z) - Rethinking Pruning Large Language Models: Benefits and Pitfalls of Reconstruction Error Minimization [18.24882084542254]
We present an array of reconstruction techniques that can significantly reduce this error by more than $90%$.
We find out that a strategy of self-generating calibration data can mitigate this trade-off between reconstruction and generalization.
arXiv Detail & Related papers (2024-06-21T05:13:34Z) - Deep Generative Symbolic Regression [83.04219479605801]
Symbolic regression aims to discover concise closed-form mathematical equations from data.
Existing methods, ranging from search to reinforcement learning, fail to scale with the number of input variables.
We propose an instantiation of our framework, Deep Generative Symbolic Regression.
arXiv Detail & Related papers (2023-12-30T17:05:31Z) - REBAR: Retrieval-Based Reconstruction for Time-series Contrastive Learning [64.08293076551601]
We propose a novel method of using a learned measure for identifying positive pairs.
Our Retrieval-Based Reconstruction measure measures the similarity between two sequences.
We show that the REBAR error is a predictor of mutual class membership.
arXiv Detail & Related papers (2023-11-01T13:44:45Z) - Understanding Augmentation-based Self-Supervised Representation Learning
via RKHS Approximation and Regression [53.15502562048627]
Recent work has built the connection between self-supervised learning and the approximation of the top eigenspace of a graph Laplacian operator.
This work delves into a statistical analysis of augmentation-based pretraining.
arXiv Detail & Related papers (2023-06-01T15:18:55Z) - Disentanglement via Latent Quantization [60.37109712033694]
In this work, we construct an inductive bias towards encoding to and decoding from an organized latent space.
We demonstrate the broad applicability of this approach by adding it to both basic data-re (vanilla autoencoder) and latent-reconstructing (InfoGAN) generative models.
arXiv Detail & Related papers (2023-05-28T06:30:29Z) - Interpretability at Scale: Identifying Causal Mechanisms in Alpaca [62.65877150123775]
We use Boundless DAS to efficiently search for interpretable causal structure in large language models while they follow instructions.
Our findings mark a first step toward faithfully understanding the inner-workings of our ever-growing and most widely deployed language models.
arXiv Detail & Related papers (2023-05-15T17:15:40Z) - Learning an Invertible Output Mapping Can Mitigate Simplicity Bias in
Neural Networks [66.76034024335833]
We investigate why diverse/ complex features are learned by the backbone, and their brittleness is due to the linear classification head relying primarily on the simplest features.
We propose Feature Reconstruction Regularizer (FRR) to ensure that the learned features can be reconstructed back from the logits.
We demonstrate up to 15% gains in OOD accuracy on the recently introduced semi-synthetic datasets with extreme distribution shifts.
arXiv Detail & Related papers (2022-10-04T04:01:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.