Related papers: Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small

Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small

URL: http://arxiv.org/abs/2409.04478v1
Date: Thu, 5 Sep 2024 18:00:37 GMT
Title: Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small
Authors: Maheep Chaudhary, Atticus Geiger,
Abstract summary: We evaluate whether SAEs trained on hidden representations of GPT-2 small have sets of features that mediate knowledge of which country a city is in and which continent it is in. Our results show that SAEs struggle to reach the neuron baseline, and none come close to the DAS skyline.
Score: 6.306964287762374
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A popular new method in mechanistic interpretability is to train high-dimensional sparse autoencoders (SAEs) on neuron activations and use SAE features as the atomic units of analysis. However, the body of evidence on whether SAE feature spaces are useful for causal analysis is underdeveloped. In this work, we use the RAVEL benchmark to evaluate whether SAEs trained on hidden representations of GPT-2 small have sets of features that separately mediate knowledge of which country a city is in and which continent it is in. We evaluate four open-source SAEs for GPT-2 small against each other, with neurons serving as a baseline, and linear features learned via distributed alignment search (DAS) serving as a skyline. For each, we learn a binary mask to select features that will be patched to change the country of a city without changing the continent, or vice versa. Our results show that SAEs struggle to reach the neuron baseline, and none come close to the DAS skyline. We release code here: https://github.com/MaheepChaudhary/SAE-Ravel

Related papers

Sparse Autoencoders Do Not Find Canonical Units of Analysis [6.0188420022822955]
A common goal of mechanistic interpretability is to decompose the activations of neural networks into features. Sparse autoencoders (SAEs) are a popular method for finding these features. We use two techniques: SAE stitching to show they are incomplete, and meta-SAEs to show they are not atomic.
arXiv Detail & Related papers (2025-02-07T12:33:08Z)
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders [73.37603699731329]
We introduce AxBench, a large-scale benchmark for steering and concept detection. For steering, we find that prompting outperforms all existing methods, followed by finetuning. For concept detection, representation-based methods such as difference-in-means, perform the best.
arXiv Detail & Related papers (2025-01-28T18:51:24Z)
Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders [8.003244901104111]
We propose a regularization technique for improving feature learning by encouraging SAEs trained in parallel to learn similar features. textscMFR can improve the reconstruction loss of SAEs by up to 21.21% on GPT-2 Small, and 6.67% on EEG data.
arXiv Detail & Related papers (2024-11-02T11:42:23Z)
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders [115.34050914216665]
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models. We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features. We assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models.
arXiv Detail & Related papers (2024-10-27T17:33:49Z)
Automatically Interpreting Millions of Features in Large Language Models [1.8035046415192353]
sparse autoencoders (SAEs) can be used to transform activations into a higher-dimensional latent space. We build an open-source pipeline to generate and evaluate natural language explanations for SAE features. Our large-scale analysis confirms that SAE latents are indeed much more interpretable than neurons.
arXiv Detail & Related papers (2024-10-17T17:56:01Z)
Efficient Dictionary Learning with Switch Sparse Autoencoders [8.577217344304072]
We introduce Switch Sparse Autoencoders, a novel SAE architecture aimed at reducing the compute cost of training SAEs. Inspired by sparse mixture of experts models, Switch SAEs route activation vectors between smaller "expert" SAEs. We find that Switch SAEs deliver a substantial improvement in the reconstruction vs. sparsity frontier for a given fixed training compute budget.
arXiv Detail & Related papers (2024-10-10T17:59:11Z)
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning [0.9374652839580183]
Identifying the features learned by neural networks is a core challenge in mechanistic interpretability. We propose end-to-end sparse dictionary learning, a method for training SAEs that ensures the features learned are functionally important. We explore geometric and qualitative differences between e2e SAE features and standard SAE features.
arXiv Detail & Related papers (2024-05-17T17:03:46Z)
RaSa: Relation and Sensitivity Aware Representation Learning for Text-based Person Search [51.09723403468361]
We propose a Relation and Sensitivity aware representation learning method (RaSa) RaSa includes two novel tasks: Relation-Aware learning (RA) and Sensitivity-Aware learning (SA) Experiments demonstrate that RaSa outperforms existing state-of-the-art methods by 6.94%, 4.45% and 15.35% in terms of Rank@1 on datasets.
arXiv Detail & Related papers (2023-05-23T03:53:57Z)
Adaptive Reordering Sampler with Neurally Guided MAGSAC [63.139445467355934]
We propose a new sampler for robust estimators that always selects the sample with the highest probability of consisting only of inliers. After every unsuccessful iteration, the inlier probabilities are updated in a principled way via a Bayesian approach. We introduce a new loss that exploits, in a geometrically justifiable manner, the orientation and scale that can be estimated for any type of feature.
arXiv Detail & Related papers (2021-11-28T10:16:38Z)
Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble [71.97020373520922]
Sign language is commonly used by deaf or mute people to communicate. We propose a novel Multi-modal Framework with a Global Ensemble Model (GEM) for isolated Sign Language Recognition ( SLR) Our proposed SAM- SLR-v2 framework is exceedingly effective and achieves state-of-the-art performance with significant margins.
arXiv Detail & Related papers (2021-10-12T16:57:18Z)
Goal-Oriented Gaze Estimation for Zero-Shot Learning [62.52340838817908]
We introduce a novel goal-oriented gaze estimation module (GEM) to improve the discriminative attribute localization. We aim to predict the actual human gaze location to get the visual attention regions for recognizing a novel object guided by attribute description. This work implies the promising benefits of collecting human gaze dataset and automatic gaze estimation algorithms on high-level computer vision tasks.
arXiv Detail & Related papers (2021-03-05T02:14:57Z)
OpenStreetMap: Challenges and Opportunities in Machine Learning and Remote Sensing [66.23463054467653]
We present a review of recent methods based on machine learning to improve and use OpenStreetMap data. We believe that OSM can change the way we interpret remote sensing data and that the synergy with machine learning can scale participatory map making.
arXiv Detail & Related papers (2020-07-13T09:58:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.