Related papers: Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need

Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need

URL: http://arxiv.org/abs/2503.17272v2
Date: Sat, 29 Mar 2025 17:42:21 GMT
Title: Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need
Authors: Adam Karvonen,
Abstract summary: Sparse autoencoders (SAEs) are widely used for interpreting language model activations.<n>Recent work introduced training SAEs directly with a combination of KL divergence and MSE.<n>We propose a brief KL+MSE fine-tuning step applied only to the final 25M training tokens.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparse autoencoders (SAEs) are widely used for interpreting language model activations. A key evaluation metric is the increase in cross-entropy loss between the original model logits and the reconstructed model logits when replacing model activations with SAE reconstructions. Typically, SAEs are trained solely on mean squared error (MSE) when reconstructing precomputed, shuffled activations. Recent work introduced training SAEs directly with a combination of KL divergence and MSE ("end-to-end" SAEs), significantly improving reconstruction accuracy at the cost of substantially increased computation, which has limited their widespread adoption. We propose a brief KL+MSE fine-tuning step applied only to the final 25M training tokens (just a few percent of typical training budgets) that achieves comparable improvements, reducing the cross-entropy loss gap by 20-50%, while incurring minimal additional computational cost. We further find that multiple fine-tuning methods (KL fine-tuning, LoRA adapters, linear adapters) yield similar, non-additive cross-entropy improvements, suggesting a common, easily correctable error source in MSE-trained SAEs. We demonstrate a straightforward method for effectively transferring hyperparameters and sparsity penalties between training phases despite scale differences between KL and MSE losses. While both ReLU and TopK SAEs see significant cross-entropy loss improvements, evaluations on supervised SAEBench metrics yield mixed results, with improvements on some metrics and decreases on others, depending on both the SAE architecture and downstream task. Nonetheless, our method may offer meaningful improvements in interpretability applications such as circuit analysis with minor additional cost.

Related papers

Tokenized SAEs: Disentangling SAE Reconstructions [0.9821874476902969]
We show that RES-JB SAE features predominantly correspond to simple input statistics. We propose a method that disentangles token reconstruction from feature reconstruction.
arXiv Detail & Related papers (2025-02-24T17:04:24Z)
Low-Rank Adapting Models for Sparse Autoencoders [6.932760557251821]
We use low-rank adaptation (LoRA) to finetune the language model itself around a previously trained SAE.<n>We analyze our method across SAE sparsity, SAE width, language model size, LoRA rank, and model layer on the Gemma Scope family of SAEs.
arXiv Detail & Related papers (2025-01-31T18:59:16Z)
The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws [51.608402959163925]
We present the first systematic exploration of optimal sparse pre-training configurations for large language models.<n>We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss.<n>We propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training.
arXiv Detail & Related papers (2025-01-21T20:23:22Z)
A deeper look at depth pruning of LLMs [49.30061112976263]
Large Language Models (LLMs) are resource-intensive to train but more costly to deploy in production. Recent work has attempted to prune blocks of LLMs based on cheap proxies for estimating block importance. We show that adaptive metrics exhibit a trade-off in performance between tasks.
arXiv Detail & Related papers (2024-07-23T08:40:27Z)
Improving Dictionary Learning with Gated Sparse Autoencoders [8.3037652157611]
Gated Sparse Autoencoder (Gated SAE) is a technique for unsupervised discovery of interpretable features in language models' (LMs) activations. In SAEs, the L1 penalty used to encourage sparsity introduces many undesirable biases, such as shrinkage. In training SAEs on LMs of up to 7B parameters, Gated SAEs solve shrinkage, and require half as many firing features to achieve comparable reconstruction fidelity.
arXiv Detail & Related papers (2024-04-24T17:47:22Z)
Rethinking Classifier Re-Training in Long-Tailed Recognition: A Simple Logits Retargeting Approach [102.0769560460338]
We develop a simple logits approach (LORT) without the requirement of prior knowledge of the number of samples per class. Our method achieves state-of-the-art performance on various imbalanced datasets, including CIFAR100-LT, ImageNet-LT, and iNaturalist 2018.
arXiv Detail & Related papers (2024-03-01T03:27:08Z)
CR-SAM: Curvature Regularized Sharpness-Aware Minimization [8.248964912483912]
Sharpness-Aware Minimization (SAM) aims to enhance the generalizability by minimizing worst-case loss using one-step gradient ascent as an approximation. In this paper, we introduce a normalized Hessian trace to accurately measure the curvature of loss landscape on em both training and test sets. In particular, to counter excessive non-linearity of loss landscape, we propose Curvature Regularized SAM (CR-SAM)
arXiv Detail & Related papers (2023-12-21T03:46:29Z)
Systematic Investigation of Sparse Perturbed Sharpness-Aware Minimization Optimizer [158.2634766682187]
Deep neural networks often suffer from poor generalization due to complex and non- unstructured loss landscapes. SharpnessAware Minimization (SAM) is a popular solution that smooths the loss by minimizing the change of landscape when adding a perturbation. In this paper, we propose Sparse SAM (SSAM), an efficient and effective training scheme that achieves perturbation by a binary mask.
arXiv Detail & Related papers (2023-06-30T09:33:41Z)
Decoupled Kullback-Leibler Divergence Loss [90.54331083430597]
We prove that the Kullback-Leibler (KL) Divergence loss is equivalent to the Decoupled Kullback-Leibler (DKL) Divergence loss. We introduce class-wise global information into KL/DKL to bias from individual samples. The proposed approach achieves new state-of-the-art adversarial robustness on the public leaderboard.
arXiv Detail & Related papers (2023-05-23T11:17:45Z)
SAFE: Machine Unlearning With Shard Graphs [100.12621304361288]
We present Synergy Aware Forgetting Ensemble (SAFE), a method to adapt large models on a diverse collection of data. SAFE uses a lightweight system of adapters which can be trained while reusing most of the computations. This allows SAFE to be trained on shards an order-of-magnitude smaller than current state-of-the-art methods.
arXiv Detail & Related papers (2023-04-25T22:02:09Z)
A Simple Adaptive Unfolding Network for Hyperspectral Image Reconstruction [33.53825801739728]
We present a simple, efficient, and scalable unfolding network, SAUNet, to simplify the network design. SAUNet can be scaled to non-trivial 13 stages with continuous improvement. We set new records on CAVE and KAIST HSI reconstruction benchmarks.
arXiv Detail & Related papers (2023-01-24T18:28:21Z)
Sharpness-Aware Training for Free [163.1248341911413]
SharpnessAware Minimization (SAM) has shown that minimizing a sharpness measure, which reflects the geometry of the loss landscape, can significantly reduce the generalization error. Sharpness-Aware Training Free (SAF) mitigates the sharp landscape at almost zero computational cost over the base. SAF ensures the convergence to a flat minimum with improved capabilities.
arXiv Detail & Related papers (2022-05-27T16:32:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.