Related papers: Efficient Dictionary Learning with Switch Sparse Autoencoders

Efficient Dictionary Learning with Switch Sparse Autoencoders

URL: http://arxiv.org/abs/2410.08201v1
Date: Thu, 10 Oct 2024 17:59:11 GMT
Title: Efficient Dictionary Learning with Switch Sparse Autoencoders
Authors: Anish Mudide, Joshua Engels, Eric J. Michaud, Max Tegmark, Christian Schroeder de Witt,
Abstract summary: We introduce Switch Sparse Autoencoders, a novel SAE architecture aimed at reducing the compute cost of training SAEs. Inspired by sparse mixture of experts models, Switch SAEs route activation vectors between smaller "expert" SAEs. We find that Switch SAEs deliver a substantial improvement in the reconstruction vs. sparsity frontier for a given fixed training compute budget.
Score: 8.577217344304072
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparse autoencoders (SAEs) are a recent technique for decomposing neural network activations into human-interpretable features. However, in order for SAEs to identify all features represented in frontier models, it will be necessary to scale them up to very high width, posing a computational challenge. In this work, we introduce Switch Sparse Autoencoders, a novel SAE architecture aimed at reducing the compute cost of training SAEs. Inspired by sparse mixture of experts models, Switch SAEs route activation vectors between smaller "expert" SAEs, enabling SAEs to efficiently scale to many more features. We present experiments comparing Switch SAEs with other SAE architectures, and find that Switch SAEs deliver a substantial Pareto improvement in the reconstruction vs. sparsity frontier for a given fixed training compute budget. We also study the geometry of features across experts, analyze features duplicated across experts, and verify that Switch SAE features are as interpretable as features found by other SAE architectures.

Related papers

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability [2.502685641292941]
SAEBench is a comprehensive evaluation suite that measures SAE performance across seven diverse metrics. We open-source a suite of over 200 SAEs across eight recently proposed SAE architectures and training algorithms. Our evaluation reveals that gains on proxy metrics do not reliably translate to better practical performance.
arXiv Detail & Related papers (2025-03-12T16:49:02Z)
Tokenized SAEs: Disentangling SAE Reconstructions [0.9821874476902969]
We show that RES-JB SAE features predominantly correspond to simple input statistics. We propose a method that disentangles token reconstruction from feature reconstruction.
arXiv Detail & Related papers (2025-02-24T17:04:24Z)
UrbanSAM: Learning Invariance-Inspired Adapters for Segment Anything Models in Urban Construction [51.54946346023673]
Urban morphology is inherently complex, with irregular objects of diverse shapes and varying scales. The Segment Anything Model (SAM) has shown significant potential in segmenting complex scenes. We propose UrbanSAM, a customized version of SAM specifically designed to analyze complex urban environments.
arXiv Detail & Related papers (2025-02-21T04:25:19Z)
Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks [1.4565166775409717]
Sparse Autoencoders (SAEs) are an interpretability technique aimed at decomposing neural network activations into interpretable units. We introduce a family of evaluations based on SHIFT, a downstream task from Marks et al. We adapt SHIFT into an automated metric of SAE quality; this involves replacing the human annotator with an LLM. We also introduce the Targeted Probe Perturbation (TPP) metric that quantifies an SAE's ability to disentangle similar concepts.
arXiv Detail & Related papers (2024-11-28T03:58:48Z)
Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders [8.003244901104111]
We propose a regularization technique for improving feature learning by encouraging SAEs trained in parallel to learn similar features. textscMFR can improve the reconstruction loss of SAEs by up to 21.21% on GPT-2 Small, and 6.67% on EEG data.
arXiv Detail & Related papers (2024-11-02T11:42:23Z)
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders [115.34050914216665]
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models. We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features. We assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models.
arXiv Detail & Related papers (2024-10-27T17:33:49Z)
Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs [0.0]
We present an information-theoretic framework for interpreting SAEs as lossy compression algorithms. We argue that using MDL rather than sparsity may avoid potential pitfalls with naively maximising sparsity.
arXiv Detail & Related papers (2024-10-15T01:38:03Z)
Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small [6.306964287762374]
We evaluate whether SAEs trained on hidden representations of GPT-2 small have sets of features that mediate knowledge of which country a city is in and which continent it is in. Our results show that SAEs struggle to reach the neuron baseline, and none come close to the DAS skyline.
arXiv Detail & Related papers (2024-09-05T18:00:37Z)
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning [0.9374652839580183]
Identifying the features learned by neural networks is a core challenge in mechanistic interpretability. We propose end-to-end sparse dictionary learning, a method for training SAEs that ensures the features learned are functionally important. We explore geometric and qualitative differences between e2e SAE features and standard SAE features.
arXiv Detail & Related papers (2024-05-17T17:03:46Z)
Demystify Transformers & Convolutions in Modern Image Deep Networks [82.32018252867277]
This paper aims to identify the real gains of popular convolution and attention operators through a detailed study. We find that the key difference among these feature transformation modules, such as attention or convolution, lies in their spatial feature aggregation approach. Our experiments on various tasks and an analysis of inductive bias show a significant performance boost due to advanced network-level and block-level designs.
arXiv Detail & Related papers (2022-11-10T18:59:43Z)
An Overview of Advances in Signal Processing Techniques for Classical and Quantum Wideband Synthetic Apertures [67.73886953504947]
Synthetic aperture (SA) systems generate a larger aperture with greater angular resolution than is inherently possible from the physical dimensions of a single sensor alone. We provide a brief overview of emerging signal processing trends in such spatially and spectrally wideband SA systems. In particular, we cover the theoretical framework and practical underpinnings of wideband SA radar, channel sounding, sonar, radiometry, and optical applications.
arXiv Detail & Related papers (2022-05-11T16:19:04Z)
Coarse-to-Fine Sparse Transformer for Hyperspectral Image Reconstruction [138.04956118993934]
We propose a novel Transformer-based method, coarse-to-fine sparse Transformer (CST) CST embedding HSI sparsity into deep learning for HSI reconstruction. In particular, CST uses our proposed spectra-aware screening mechanism (SASM) for coarse patch selecting. Then the selected patches are fed into our customized spectra-aggregation hashing multi-head self-attention (SAH-MSA) for fine pixel clustering and self-similarity capturing.
arXiv Detail & Related papers (2022-03-09T16:17:47Z)
Taming Sparsely Activated Transformer with Stochastic Experts [76.0711573018493]
Sparsely activated models (SAMs) can easily scale to have outrageously large amounts of parameters without significant increase in computational cost. In this paper, we propose a new expert-based model, THOR (Transformer witH StOchastic ExpeRts) Unlike classic expert-based models, such as the Switch Transformer, experts in THOR are randomly activated for each input during training and inference.
arXiv Detail & Related papers (2021-10-08T17:15:47Z)
Lightweight Single-Image Super-Resolution Network with Attentive Auxiliary Feature Learning [73.75457731689858]
We develop a computation efficient yet accurate network based on the proposed attentive auxiliary features (A$2$F) for SISR. Experimental results on large-scale dataset demonstrate the effectiveness of the proposed model against the state-of-the-art (SOTA) SR methods.
arXiv Detail & Related papers (2020-11-13T06:01:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.