Related papers: Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders

Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders

URL: http://arxiv.org/abs/2411.01220v2
Date: Wed, 06 Nov 2024 08:42:09 GMT
Title: Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders
Authors: Luke Marks, Alasdair Paren, David Krueger, Fazl Barez,
Abstract summary: We propose a regularization technique for improving feature learning by encouraging SAEs trained in parallel to learn similar features. textscMFR can improve the reconstruction loss of SAEs by up to 21.21% on GPT-2 Small, and 6.67% on EEG data.
Score: 8.003244901104111
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparse Autoencoders (SAEs) have shown promise in improving the interpretability of neural network activations, but can learn features that are not features of the input, limiting their effectiveness. We propose \textsc{Mutual Feature Regularization} \textbf{(MFR)}, a regularization technique for improving feature learning by encouraging SAEs trained in parallel to learn similar features. We motivate \textsc{MFR} by showing that features learned by multiple SAEs are more likely to correlate with features of the input. By training on synthetic data with known features of the input, we show that \textsc{MFR} can help SAEs learn those features, as we can directly compare the features learned by the SAE with the input features for the synthetic data. We then scale \textsc{MFR} to SAEs that are trained to denoise electroencephalography (EEG) data and SAEs that are trained to reconstruct GPT-2 Small activations. We show that \textsc{MFR} can improve the reconstruction loss of SAEs by up to 21.21\% on GPT-2 Small, and 6.67\% on EEG data. Our results suggest that the similarity between features learned by different SAEs can be leveraged to improve SAE training, thereby enhancing performance and the usefulness of SAEs for model interpretability.

Related papers

Forgetting: A New Mechanism Towards Better Large Language Model Fine-tuning [53.398270878295754]
Supervised fine-tuning (SFT) plays a critical role for pretrained large language models (LLMs)<n>We suggest categorizing tokens within each corpus into two parts -- positive and negative tokens -- based on whether they are useful to improve model performance.<n>We conduct experiments on well-established benchmarks, finding that this forgetting mechanism not only improves overall model performance and also facilitate more diverse model responses.
arXiv Detail & Related papers (2025-08-06T11:22:23Z)
FaithfulSAE: Towards Capturing Faithful Features with Sparse Autoencoders without External Dataset Dependencies [3.709351921096894]
We propose FaithfulSAE, a method that trains SAEs on the model's own synthetic dataset.<n>We demonstrate that training SAEs on less-OOD instruction datasets results in SAEs being more stable across seeds.
arXiv Detail & Related papers (2025-06-21T10:18:25Z)
Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders [50.52694757593443]
Existing SAE training algorithms often lack rigorous mathematical guarantees and suffer from practical limitations.<n>We first propose a novel statistical framework for the feature recovery problem, which includes a new notion of feature identifiability.<n>We introduce a new SAE training algorithm based on bias adaptation'', a technique that adaptively adjusts neural network bias parameters to ensure appropriate activation sparsity.
arXiv Detail & Related papers (2025-06-16T20:58:05Z)
Ensembling Sparse Autoencoders [10.81463830315253]
Sparse autoencoders (SAEs) are used to decompose neural network activations into human-interpretable features.<n>We propose to ensemble multiple SAEs through naive bagging and boosting.<n>Our empirical results demonstrate that ensembling SAEs can improve the reconstruction of language model activations, diversity of features, and SAE stability.
arXiv Detail & Related papers (2025-05-21T23:31:21Z)
Tokenized SAEs: Disentangling SAE Reconstructions [0.9821874476902969]
We show that RES-JB SAE features predominantly correspond to simple input statistics.<n>We propose a method that disentangles token reconstruction from feature reconstruction.
arXiv Detail & Related papers (2025-02-24T17:04:24Z)
Disentangling CLIP Features for Enhanced Localized Understanding [58.73850193789384]
We propose Unmix-CLIP, a novel framework designed to reduce mutual feature information (MFI) and improve feature disentanglement. For the COCO- 14 dataset, Unmix-CLIP reduces feature similarity by 24.9%.
arXiv Detail & Related papers (2025-02-05T08:20:31Z)
Training Strategies for Isolated Sign Language Recognition [72.27323884094953]
This paper introduces a comprehensive model training pipeline for Isolated Sign Language Recognition. The constructed pipeline incorporates carefully selected image and video augmentations to tackle the challenges of low data quality and varying sign speeds. We achieve a state-of-the-art result on the WLASL and Slovo benchmarks with 1.63% and 14.12% improvements compared to the previous best solution.
arXiv Detail & Related papers (2024-12-16T08:37:58Z)
Understanding the Role of Equivariance in Self-supervised Learning [51.56331245499712]
equivariant self-supervised learning (E-SSL) learns features to be augmentation-aware. We identify a critical explaining-away effect in E-SSL that creates a synergy between the equivariant and classification tasks. We reveal several principles for practical designs of E-SSL.
arXiv Detail & Related papers (2024-11-10T16:09:47Z)
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders [115.34050914216665]
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models. We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features. We assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models.
arXiv Detail & Related papers (2024-10-27T17:33:49Z)
Automatically Interpreting Millions of Features in Large Language Models [1.8035046415192353]
sparse autoencoders (SAEs) can be used to transform activations into a higher-dimensional latent space. We build an open-source pipeline to generate and evaluate natural language explanations for SAE features. Our large-scale analysis confirms that SAE latents are indeed much more interpretable than neurons.
arXiv Detail & Related papers (2024-10-17T17:56:01Z)
Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs [0.0]
We present an information-theoretic framework for interpreting SAEs as lossy compression algorithms. We argue that using MDL rather than sparsity may avoid potential pitfalls with naively maximising sparsity.
arXiv Detail & Related papers (2024-10-15T01:38:03Z)
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models [18.77400885091398]
We propose to measure progress in interpretable dictionary learning by working in the setting of LMs trained on chess and Othello transcripts. We introduce a new SAE training technique, $textitp-annealing$, which improves performance on prior unsupervised metrics as well as our new metrics.
arXiv Detail & Related papers (2024-07-31T18:45:13Z)
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning [0.9374652839580183]
Identifying the features learned by neural networks is a core challenge in mechanistic interpretability. We propose end-to-end sparse dictionary learning, a method for training SAEs that ensures the features learned are functionally important. We explore geometric and qualitative differences between e2e SAE features and standard SAE features.
arXiv Detail & Related papers (2024-05-17T17:03:46Z)
Robust Learning with Progressive Data Expansion Against Spurious Correlation [65.83104529677234]
We study the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features. Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process. We propose a new training algorithm called PDE that efficiently enhances the model's robustness for a better worst-group performance.
arXiv Detail & Related papers (2023-06-08T05:44:06Z)
RaSa: Relation and Sensitivity Aware Representation Learning for Text-based Person Search [51.09723403468361]
We propose a Relation and Sensitivity aware representation learning method (RaSa) RaSa includes two novel tasks: Relation-Aware learning (RA) and Sensitivity-Aware learning (SA) Experiments demonstrate that RaSa outperforms existing state-of-the-art methods by 6.94%, 4.45% and 15.35% in terms of Rank@1 on datasets.
arXiv Detail & Related papers (2023-05-23T03:53:57Z)
Toward Understanding the Feature Learning Process of Self-supervised Contrastive Learning [43.504548777955854]
We study how contrastive learning learns the feature representations for neural networks by analyzing its feature learning process. We prove that contrastive learning using textbfReLU networks provably learns the desired sparse features if proper augmentations are adopted.
arXiv Detail & Related papers (2021-05-31T16:42:09Z)
Lightweight Single-Image Super-Resolution Network with Attentive Auxiliary Feature Learning [73.75457731689858]
We develop a computation efficient yet accurate network based on the proposed attentive auxiliary features (A$2$F) for SISR. Experimental results on large-scale dataset demonstrate the effectiveness of the proposed model against the state-of-the-art (SOTA) SR methods.
arXiv Detail & Related papers (2020-11-13T06:01:46Z)
Probing Linguistic Features of Sentence-Level Representations in Neural Relation Extraction [80.38130122127882]
We introduce 14 probing tasks targeting linguistic properties relevant to neural relation extraction (RE) We use them to study representations learned by more than 40 different encoder architecture and linguistic feature combinations trained on two datasets. We find that the bias induced by the architecture and the inclusion of linguistic features are clearly expressed in the probing task performance.
arXiv Detail & Related papers (2020-04-17T09:17:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.