Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs
- URL: http://arxiv.org/abs/2410.11179v1
- Date: Tue, 15 Oct 2024 01:38:03 GMT
- Title: Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs
- Authors: Kola Ayonrinde, Michael T. Pearce, Lee Sharkey,
- Abstract summary: We present an information-theoretic framework for interpreting SAEs as lossy compression algorithms.
We argue that using MDL rather than sparsity may avoid potential pitfalls with naively maximising sparsity.
- Score: 0.0
- License:
- Abstract: Sparse Autoencoders (SAEs) have emerged as a useful tool for interpreting the internal representations of neural networks. However, naively optimising SAEs for reconstruction loss and sparsity results in a preference for SAEs that are extremely wide and sparse. We present an information-theoretic framework for interpreting SAEs as lossy compression algorithms for communicating explanations of neural activations. We appeal to the Minimal Description Length (MDL) principle to motivate explanations of activations which are both accurate and concise. We further argue that interpretable SAEs require an additional property, "independent additivity": features should be able to be understood separately. We demonstrate an example of applying our MDL-inspired framework by training SAEs on MNIST handwritten digits and find that SAE features representing significant line segments are optimal, as opposed to SAEs with features for memorised digits from the dataset or small digit fragments. We argue that using MDL rather than sparsity may avoid potential pitfalls with naively maximising sparsity such as undesirable feature splitting and that this framework naturally suggests new hierarchical SAE architectures which provide more concise explanations.
Related papers
- Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders [8.003244901104111]
We propose a regularization technique for improving feature learning by encouraging SAEs trained in parallel to learn similar features.
textscMFR can improve the reconstruction loss of SAEs by up to 21.21% on GPT-2 Small, and 6.67% on EEG data.
arXiv Detail & Related papers (2024-11-02T11:42:23Z) - Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models [26.748765050034876]
Specialized Sparse Autoencoders (SSAEs) illuminate elusive dark matter features by focusing on specific.
We show that SSAEs effectively capture subdomain tail concepts, exceeding the capabilities of general-purpose SAEs.
We showcase the practical utility of SSAEs in a case study on the Bias in Bios dataset, where SSAEs achieve a 12.5% increase in worst-group classification accuracy when applied to remove spurious gender information.
arXiv Detail & Related papers (2024-11-01T17:09:34Z) - Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders [115.34050914216665]
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models.
We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features.
We assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models.
arXiv Detail & Related papers (2024-10-27T17:33:49Z) - Automatically Interpreting Millions of Features in Large Language Models [1.8035046415192353]
sparse autoencoders (SAEs) can be used to transform activations into a higher-dimensional latent space.
We build an open-source pipeline to generate and evaluate natural language explanations for SAE features.
Our large-scale analysis confirms that SAE latents are indeed much more interpretable than neurons.
arXiv Detail & Related papers (2024-10-17T17:56:01Z) - Disentangling Dense Embeddings with Sparse Autoencoders [0.0]
Sparse autoencoders (SAEs) have shown promise in extracting interpretable features from complex neural networks.
We present one of the first applications of SAEs to dense text embeddings from large language models.
We show that the resulting sparse representations maintain semantic fidelity while offering interpretability.
arXiv Detail & Related papers (2024-08-01T15:46:22Z) - Semantic Loss Functions for Neuro-Symbolic Structured Prediction [74.18322585177832]
We discuss the semantic loss, which injects knowledge about such structure, defined symbolically, into training.
It is agnostic to the arrangement of the symbols, and depends only on the semantics expressed thereby.
It can be combined with both discriminative and generative neural models.
arXiv Detail & Related papers (2024-05-12T22:18:25Z) - Improving Dictionary Learning with Gated Sparse Autoencoders [8.3037652157611]
Gated Sparse Autoencoder (Gated SAE) is a technique for unsupervised discovery of interpretable features in language models' (LMs) activations.
In SAEs, the L1 penalty used to encourage sparsity introduces many undesirable biases, such as shrinkage.
In training SAEs on LMs of up to 7B parameters, Gated SAEs solve shrinkage, and require half as many firing features to achieve comparable reconstruction fidelity.
arXiv Detail & Related papers (2024-04-24T17:47:22Z) - Evaluating and Explaining Large Language Models for Code Using Syntactic
Structures [74.93762031957883]
This paper introduces ASTxplainer, an explainability method specific to Large Language Models for code.
At its core, ASTxplainer provides an automated method for aligning token predictions with AST nodes.
We perform an empirical evaluation on 12 popular LLMs for code using a curated dataset of the most popular GitHub projects.
arXiv Detail & Related papers (2023-08-07T18:50:57Z) - Improving Self-Supervised Learning by Characterizing Idealized
Representations [155.1457170539049]
We prove necessary and sufficient conditions for any task invariant to given data augmentations.
For contrastive learning, our framework prescribes simple but significant improvements to previous methods.
For non-contrastive learning, we use our framework to derive a simple and novel objective.
arXiv Detail & Related papers (2022-09-13T18:01:03Z) - Self-Supervised Learning Disentangled Group Representation as Feature [82.07737719232972]
We show that existing Self-Supervised Learning (SSL) only disentangles simple augmentation features such as rotation and colorization.
We propose an iterative SSL algorithm: Iterative Partition-based Invariant Risk Minimization (IP-IRM)
We prove that IP-IRM converges to a fully disentangled representation and show its effectiveness on various benchmarks.
arXiv Detail & Related papers (2021-10-28T16:12:33Z) - Discovering "Semantics" in Super-Resolution Networks [54.45509260681529]
Super-resolution (SR) is a fundamental and representative task of low-level vision area.
It is generally thought that the features extracted from the SR network have no specific semantic information.
Can we find any "semantics" in SR networks?
arXiv Detail & Related papers (2021-08-01T09:12:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.