Related papers: SAGE: Scalable Ground Truth Evaluations for Large Sparse Autoencoders

SAGE: Scalable Ground Truth Evaluations for Large Sparse Autoencoders

URL: http://arxiv.org/abs/2410.07456v1
Date: Wed, 9 Oct 2024 21:42:39 GMT
Title: SAGE: Scalable Ground Truth Evaluations for Large Sparse Autoencoders
Authors: Constantin Venhoff, Anisoara Calinescu, Philip Torr, Christian Schroeder de Witt,
Abstract summary: We introduce SAGE: Scalable Autoencoder Ground-truth Evaluation, a ground truth evaluation framework for SAEs. We demonstrate that our method can automatically identify task-specific activations and compute ground truth features at these points. Our framework paves the way for generalizable, large-scale evaluations of SAEs in interpretability research.
Score: 7.065809768803578
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A key challenge in interpretability is to decompose model activations into meaningful features. Sparse autoencoders (SAEs) have emerged as a promising tool for this task. However, a central problem in evaluating the quality of SAEs is the absence of ground truth features to serve as an evaluation gold standard. Current evaluation methods for SAEs are therefore confronted with a significant trade-off: SAEs can either leverage toy models or other proxies with predefined ground truth features; or they use extensive prior knowledge of realistic task circuits. The former limits the generalizability of the evaluation results, while the latter limits the range of models and tasks that can be used for evaluations. We introduce SAGE: Scalable Autoencoder Ground-truth Evaluation, a ground truth evaluation framework for SAEs that scales to large state-of-the-art SAEs and models. We demonstrate that our method can automatically identify task-specific activations and compute ground truth features at these points. Compared to previous methods we reduce the training overhead by introducing a novel reconstruction method that allows to apply residual stream SAEs to sublayer activations. This eliminates the need for SAEs trained on every task-specific activation location. Then we validate the scalability of our framework, by evaluating SAEs on novel tasks on Pythia70M, GPT-2 Small, and Gemma-2-2. Our framework therefore paves the way for generalizable, large-scale evaluations of SAEs in interpretability research.

Related papers

Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality [3.9230690073443166]
We show that the magnitude of sparse feature vectors can be approximated using their corresponding dense vector with a closed-form error bound. We introduce Approximate Activation Feature (AFA), which approximates the magnitude of the ground-truth sparse feature vector. We demonstrate that top-AFA SAEs achieve reconstruction loss comparable to that of state-of-the-art top-k SAEs.
arXiv Detail & Related papers (2025-03-31T16:22:11Z)
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing [6.836374436707495]
Sparse autoencoders (SAEs) are a popular method for interpreting concepts represented in large language model (LLM) activations. One alternative source of evidence would be demonstrating that SAEs improve performance on downstream tasks beyond existing baselines. We test this by applying SAEs to the real-world task of LLM activation probing in four regimes.
arXiv Detail & Related papers (2025-02-23T18:54:15Z)
Sparse Autoencoder Features for Classifications and Transferability [11.2185030332009]
We analyze Sparse Autoencoders (SAEs) for interpretable feature extraction from Large Language Models (LLMs) Our framework evaluates (1) model-layer selection and scaling properties, (2) SAE architectural configurations, including width and pooling strategies, and (3) the effect of binarizing continuous SAE activations.
arXiv Detail & Related papers (2025-02-17T02:30:45Z)
Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks [1.4565166775409717]
Sparse Autoencoders (SAEs) are an interpretability technique aimed at decomposing neural network activations into interpretable units. We introduce a family of evaluations based on SHIFT, a downstream task from Marks et al. We adapt SHIFT into an automated metric of SAE quality; this involves replacing the human annotator with an LLM. We also introduce the Targeted Probe Perturbation (TPP) metric that quantifies an SAE's ability to disentangle similar concepts.
arXiv Detail & Related papers (2024-11-28T03:58:48Z)
Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models [26.748765050034876]
Specialized Sparse Autoencoders (SSAEs) illuminate elusive dark matter features by focusing on specific. We show that SSAEs effectively capture subdomain tail concepts, exceeding the capabilities of general-purpose SAEs. We showcase the practical utility of SSAEs in a case study on the Bias in Bios dataset, where SSAEs achieve a 12.5% increase in worst-group classification accuracy when applied to remove spurious gender information.
arXiv Detail & Related papers (2024-11-01T17:09:34Z)
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders [115.34050914216665]
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models. We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features. We assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models.
arXiv Detail & Related papers (2024-10-27T17:33:49Z)
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains [54.117238759317004]
Massive Multitask Agent Understanding (MMAU) benchmark features comprehensive offline tasks that eliminate the need for complex environment setups. It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents.
arXiv Detail & Related papers (2024-07-18T00:58:41Z)
SS-ADA: A Semi-Supervised Active Domain Adaptation Framework for Semantic Segmentation [25.929173344653158]
We propose a novel semi-supervised active domain adaptation (SS-ADA) framework for semantic segmentation. SS-ADA integrates active learning into semi-supervised semantic segmentation to achieve the accuracy of supervised learning. We conducted extensive experiments on synthetic-to-real and real-to-real domain adaptation settings.
arXiv Detail & Related papers (2024-06-17T13:40:42Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
WeakSAM: Segment Anything Meets Weakly-supervised Instance-level Recognition [38.42053754669399]
Weakly supervised visual recognition using inexact supervision is a critical yet challenging learning problem. This paper introduces WeakSAM and solves the weakly-supervised object detection (WSOD) and segmentation by utilizing the pre-learned world knowledge contained in a vision foundation model, i.e., the Segment Anything Model (SAM) Our results indicate that WeakSAM significantly surpasses previous state-of-the-art methods in WSOD and WSIS benchmarks with large margins, i.e. average improvements of 7.4% and 8.5%, respectively.
arXiv Detail & Related papers (2024-02-22T18:59:24Z)
Tracking Object Positions in Reinforcement Learning: A Metric for Keypoint Detection (extended version) [5.467140383171385]
Reinforcement learning (RL) for robot control typically requires a detailed representation of the environment state. Keypoint detectors, such as spatial autoencoders (SAEs), are a common approach to extracting a low-dimensional representation from high-dimensional image data.
arXiv Detail & Related papers (2023-12-01T13:56:28Z)
RLSAC: Reinforcement Learning enhanced Sample Consensus for End-to-End Robust Estimation [74.47709320443998]
We propose RLSAC, a novel Reinforcement Learning enhanced SAmple Consensus framework for end-to-end robust estimation. RLSAC employs a graph neural network to utilize both data and memory features to guide exploring directions for sampling the next minimum set. Our experimental results demonstrate that RLSAC can learn from features to gradually explore a better hypothesis.
arXiv Detail & Related papers (2023-08-10T03:14:19Z)
Sentiment Analysis in the Era of Large Language Models: A Reality Check [69.97942065617664]
This paper investigates the capabilities of large language models (LLMs) in performing various sentiment analysis tasks. We evaluate performance across 13 tasks on 26 datasets and compare the results against small language models (SLMs) trained on domain-specific datasets.
arXiv Detail & Related papers (2023-05-24T10:45:25Z)
SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities [76.97949110580703]
We introduce SUPERB-SG, a new benchmark to evaluate pre-trained models across various speech tasks. We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain. We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation.
arXiv Detail & Related papers (2022-03-14T04:26:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.