SAGE: Scalable Ground Truth Evaluations for Large Sparse Autoencoders
- URL: http://arxiv.org/abs/2410.07456v1
- Date: Wed, 9 Oct 2024 21:42:39 GMT
- Title: SAGE: Scalable Ground Truth Evaluations for Large Sparse Autoencoders
- Authors: Constantin Venhoff, Anisoara Calinescu, Philip Torr, Christian Schroeder de Witt,
- Abstract summary: We introduce SAGE: Scalable Autoencoder Ground-truth Evaluation, a ground truth evaluation framework for SAEs.
We demonstrate that our method can automatically identify task-specific activations and compute ground truth features at these points.
Our framework paves the way for generalizable, large-scale evaluations of SAEs in interpretability research.
- Score: 7.065809768803578
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A key challenge in interpretability is to decompose model activations into meaningful features. Sparse autoencoders (SAEs) have emerged as a promising tool for this task. However, a central problem in evaluating the quality of SAEs is the absence of ground truth features to serve as an evaluation gold standard. Current evaluation methods for SAEs are therefore confronted with a significant trade-off: SAEs can either leverage toy models or other proxies with predefined ground truth features; or they use extensive prior knowledge of realistic task circuits. The former limits the generalizability of the evaluation results, while the latter limits the range of models and tasks that can be used for evaluations. We introduce SAGE: Scalable Autoencoder Ground-truth Evaluation, a ground truth evaluation framework for SAEs that scales to large state-of-the-art SAEs and models. We demonstrate that our method can automatically identify task-specific activations and compute ground truth features at these points. Compared to previous methods we reduce the training overhead by introducing a novel reconstruction method that allows to apply residual stream SAEs to sublayer activations. This eliminates the need for SAEs trained on every task-specific activation location. Then we validate the scalability of our framework, by evaluating SAEs on novel tasks on Pythia70M, GPT-2 Small, and Gemma-2-2. Our framework therefore paves the way for generalizable, large-scale evaluations of SAEs in interpretability research.
Related papers
- Sparse Autoencoder Features for Classifications and Transferability [11.2185030332009]
We analyze Sparse Autoencoders (SAEs) for interpretable feature extraction from Large Language Models (LLMs)
Our framework evaluates (1) model-layer selection and scaling properties, (2) SAE architectural configurations, including width and pooling strategies, and (3) the effect of binarizing continuous SAE activations.
arXiv Detail & Related papers (2025-02-17T02:30:45Z) - Are Your LLMs Capable of Stable Reasoning? [38.03049704515947]
Large Language Models (LLMs) have demonstrated remarkable progress in complex reasoning tasks.
However, a significant discrepancy persists between benchmark performances and real-world applications.
We introduce G-Pass@k, a novel evaluation metric that provides a continuous assessment of model performance.
We present LiveMathBench, a dynamic benchmark comprising challenging, contemporary mathematical problems.
arXiv Detail & Related papers (2024-12-17T18:12:47Z) - Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks [1.4565166775409717]
Sparse Autoencoders (SAEs) are an interpretability technique aimed at decomposing neural network activations into interpretable units.
We introduce a family of evaluations based on SHIFT, a downstream task from Marks et al.
We adapt SHIFT into an automated metric of SAE quality; this involves replacing the human annotator with an LLM.
We also introduce the Targeted Probe Perturbation (TPP) metric that quantifies an SAE's ability to disentangle similar concepts.
arXiv Detail & Related papers (2024-11-28T03:58:48Z) - Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models [26.748765050034876]
Specialized Sparse Autoencoders (SSAEs) illuminate elusive dark matter features by focusing on specific.
We show that SSAEs effectively capture subdomain tail concepts, exceeding the capabilities of general-purpose SAEs.
We showcase the practical utility of SSAEs in a case study on the Bias in Bios dataset, where SSAEs achieve a 12.5% increase in worst-group classification accuracy when applied to remove spurious gender information.
arXiv Detail & Related papers (2024-11-01T17:09:34Z) - Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders [115.34050914216665]
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models.
We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features.
We assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models.
arXiv Detail & Related papers (2024-10-27T17:33:49Z) - MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains [54.117238759317004]
Massive Multitask Agent Understanding (MMAU) benchmark features comprehensive offline tasks that eliminate the need for complex environment setups.
It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics.
With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents.
arXiv Detail & Related papers (2024-07-18T00:58:41Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Tracking Object Positions in Reinforcement Learning: A Metric for Keypoint Detection (extended version) [5.467140383171385]
Reinforcement learning (RL) for robot control typically requires a detailed representation of the environment state.
Keypoint detectors, such as spatial autoencoders (SAEs), are a common approach to extracting a low-dimensional representation from high-dimensional image data.
arXiv Detail & Related papers (2023-12-01T13:56:28Z) - RLSAC: Reinforcement Learning enhanced Sample Consensus for End-to-End
Robust Estimation [74.47709320443998]
We propose RLSAC, a novel Reinforcement Learning enhanced SAmple Consensus framework for end-to-end robust estimation.
RLSAC employs a graph neural network to utilize both data and memory features to guide exploring directions for sampling the next minimum set.
Our experimental results demonstrate that RLSAC can learn from features to gradually explore a better hypothesis.
arXiv Detail & Related papers (2023-08-10T03:14:19Z) - Sentiment Analysis in the Era of Large Language Models: A Reality Check [69.97942065617664]
This paper investigates the capabilities of large language models (LLMs) in performing various sentiment analysis tasks.
We evaluate performance across 13 tasks on 26 datasets and compare the results against small language models (SLMs) trained on domain-specific datasets.
arXiv Detail & Related papers (2023-05-24T10:45:25Z) - SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark
for Semantic and Generative Capabilities [76.97949110580703]
We introduce SUPERB-SG, a new benchmark to evaluate pre-trained models across various speech tasks.
We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain.
We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation.
arXiv Detail & Related papers (2022-03-14T04:26:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.