Related papers: Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

URL: http://arxiv.org/abs/2410.11179v1
Date: Tue, 15 Oct 2024 01:38:03 GMT
Title: Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs
Authors: Kola Ayonrinde, Michael T. Pearce, Lee Sharkey,
Abstract summary: We present an information-theoretic framework for interpreting SAEs as lossy compression algorithms. We argue that using MDL rather than sparsity may avoid potential pitfalls with naively maximising sparsity.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparse Autoencoders (SAEs) have emerged as a useful tool for interpreting the internal representations of neural networks. However, naively optimising SAEs for reconstruction loss and sparsity results in a preference for SAEs that are extremely wide and sparse. We present an information-theoretic framework for interpreting SAEs as lossy compression algorithms for communicating explanations of neural activations. We appeal to the Minimal Description Length (MDL) principle to motivate explanations of activations which are both accurate and concise. We further argue that interpretable SAEs require an additional property, "independent additivity": features should be able to be understood separately. We demonstrate an example of applying our MDL-inspired framework by training SAEs on MNIST handwritten digits and find that SAE features representing significant line segments are optimal, as opposed to SAEs with features for memorised digits from the dataset or small digit fragments. We argue that using MDL rather than sparsity may avoid potential pitfalls with naively maximising sparsity such as undesirable feature splitting and that this framework naturally suggests new hierarchical SAE architectures which provide more concise explanations.

Related papers

Empirical Evaluation of Progressive Coding for Sparse Autoencoders [45.94517951918044]
We show that dictionary importance in vanilla SAEs follows a power law. We show Matryoshka SAEs exhibit lower reconstruction loss and recaptured language modeling loss.
arXiv Detail & Related papers (2025-04-30T21:08:32Z)
Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models [50.587868616659826]
Sparse Autoencoders (SAEs) have been shown to enhance interpretability and steerability in Large Language Models (LLMs) In this work, we extend the application of SAEs to Vision-Language Models (VLMs), such as CLIP, and introduce a comprehensive framework for evaluating monosemanticity in vision representations.
arXiv Detail & Related papers (2025-04-03T17:58:35Z)
Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality [3.9230690073443166]
We show that the magnitude of sparse feature vectors can be approximated using their corresponding dense vector with a closed-form error bound. We introduce Approximate Activation Feature (AFA), which approximates the magnitude of the ground-truth sparse feature vector. We demonstrate that top-AFA SAEs achieve reconstruction loss comparable to that of state-of-the-art top-k SAEs.
arXiv Detail & Related papers (2025-03-31T16:22:11Z)
Interpreting CLIP with Hierarchical Sparse Autoencoders [8.692675181549117]
Matryoshka SAE (MSAE) learns hierarchical representations at multiple granularities simultaneously. MSAE establishes a new state-of-the-art frontier between reconstruction quality and sparsity for CLIP.
arXiv Detail & Related papers (2025-02-27T22:39:13Z)
Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders [0.0]
We investigate sparse inference and learning in SAEs through the lens of sparse coding. We show that SAEs perform amortised sparse inference with a computationally restricted encoder. We empirically explore conditions where more sophisticated sparse inference methods outperform traditional SAE encoders.
arXiv Detail & Related papers (2024-11-20T08:21:53Z)
Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders [8.003244901104111]
We propose a regularization technique for improving feature learning by encouraging SAEs trained in parallel to learn similar features. textscMFR can improve the reconstruction loss of SAEs by up to 21.21% on GPT-2 Small, and 6.67% on EEG data.
arXiv Detail & Related papers (2024-11-02T11:42:23Z)
Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models [26.748765050034876]
Specialized Sparse Autoencoders (SSAEs) illuminate elusive dark matter features by focusing on specific. We show that SSAEs effectively capture subdomain tail concepts, exceeding the capabilities of general-purpose SAEs. We showcase the practical utility of SSAEs in a case study on the Bias in Bios dataset, where SSAEs achieve a 12.5% increase in worst-group classification accuracy when applied to remove spurious gender information.
arXiv Detail & Related papers (2024-11-01T17:09:34Z)
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders [115.34050914216665]
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models. We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features. We assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models.
arXiv Detail & Related papers (2024-10-27T17:33:49Z)
Automatically Interpreting Millions of Features in Large Language Models [1.8035046415192353]
sparse autoencoders (SAEs) can be used to transform activations into a higher-dimensional latent space. We build an open-source pipeline to generate and evaluate natural language explanations for SAE features. Our large-scale analysis confirms that SAE latents are indeed much more interpretable than neurons.
arXiv Detail & Related papers (2024-10-17T17:56:01Z)
Disentangling Dense Embeddings with Sparse Autoencoders [0.0]
Sparse autoencoders (SAEs) have shown promise in extracting interpretable features from complex neural networks. We present one of the first applications of SAEs to dense text embeddings from large language models. We show that the resulting sparse representations maintain semantic fidelity while offering interpretability.
arXiv Detail & Related papers (2024-08-01T15:46:22Z)
Semantic Loss Functions for Neuro-Symbolic Structured Prediction [74.18322585177832]
We discuss the semantic loss, which injects knowledge about such structure, defined symbolically, into training. It is agnostic to the arrangement of the symbols, and depends only on the semantics expressed thereby. It can be combined with both discriminative and generative neural models.
arXiv Detail & Related papers (2024-05-12T22:18:25Z)
Evaluating and Explaining Large Language Models for Code Using Syntactic Structures [74.93762031957883]
This paper introduces ASTxplainer, an explainability method specific to Large Language Models for code. At its core, ASTxplainer provides an automated method for aligning token predictions with AST nodes. We perform an empirical evaluation on 12 popular LLMs for code using a curated dataset of the most popular GitHub projects.
arXiv Detail & Related papers (2023-08-07T18:50:57Z)
Improving Self-Supervised Learning by Characterizing Idealized Representations [155.1457170539049]
We prove necessary and sufficient conditions for any task invariant to given data augmentations. For contrastive learning, our framework prescribes simple but significant improvements to previous methods. For non-contrastive learning, we use our framework to derive a simple and novel objective.
arXiv Detail & Related papers (2022-09-13T18:01:03Z)
Self-Supervised Learning Disentangled Group Representation as Feature [82.07737719232972]
We show that existing Self-Supervised Learning (SSL) only disentangles simple augmentation features such as rotation and colorization. We propose an iterative SSL algorithm: Iterative Partition-based Invariant Risk Minimization (IP-IRM) We prove that IP-IRM converges to a fully disentangled representation and show its effectiveness on various benchmarks.
arXiv Detail & Related papers (2021-10-28T16:12:33Z)
Discovering "Semantics" in Super-Resolution Networks [54.45509260681529]
Super-resolution (SR) is a fundamental and representative task of low-level vision area. It is generally thought that the features extracted from the SR network have no specific semantic information. Can we find any "semantics" in SR networks?
arXiv Detail & Related papers (2021-08-01T09:12:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.