Related papers: Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

URL: http://arxiv.org/abs/2405.08366v3
Date: Mon, 20 May 2024 17:46:14 GMT
Title: Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
Authors: Aleksandar Makelov, George Lange, Neel Nanda,
Abstract summary: We propose a framework for evaluating feature dictionaries in the context of specific tasks, by comparing them against emphmagnitude feature dictionaries. First, we demonstrate that supervised dictionaries achieve excellent approximation, control, and interpretability of model computations on the task. We apply this framework to the indirect object identification (IOI) task using GPT-2 Small, with sparse autoencoders (SAEs) trained on either the IOI or OpenWebText datasets.
Score: 43.860799289234755
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Disentangling model activations into meaningful features is a central problem in interpretability. However, the absence of ground-truth for these features in realistic scenarios makes validating recent approaches, such as sparse dictionary learning, elusive. To address this challenge, we propose a framework for evaluating feature dictionaries in the context of specific tasks, by comparing them against \emph{supervised} feature dictionaries. First, we demonstrate that supervised dictionaries achieve excellent approximation, control, and interpretability of model computations on the task. Second, we use the supervised dictionaries to develop and contextualize evaluations of unsupervised dictionaries along the same three axes. We apply this framework to the indirect object identification (IOI) task using GPT-2 Small, with sparse autoencoders (SAEs) trained on either the IOI or OpenWebText datasets. We find that these SAEs capture interpretable features for the IOI task, but they are less successful than supervised features in controlling the model. Finally, we observe two qualitative phenomena in SAE training: feature occlusion (where a causally relevant concept is robustly overshadowed by even slightly higher-magnitude ones in the learned features), and feature over-splitting (where binary features split into many smaller, less interpretable features). We hope that our framework will provide a useful step towards more objective and grounded evaluations of sparse dictionary learning methods.

Related papers

Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models [16.894375498353092]
Sparse Autoencoders (SAEs) have emerged as a powerful framework for machine learning interpretability. Existing SAEs exhibit severe instability, as identical models trained on similar datasets can produce sharply different dictionaries. We present Archetypal SAEs, wherein dictionary atoms are constrained to the convex hull of data.
arXiv Detail & Related papers (2025-02-18T14:29:11Z)
Disentangling Dense Embeddings with Sparse Autoencoders [0.0]
Sparse autoencoders (SAEs) have shown promise in extracting interpretable features from complex neural networks. We present one of the first applications of SAEs to dense text embeddings from large language models. We show that the resulting sparse representations maintain semantic fidelity while offering interpretability.
arXiv Detail & Related papers (2024-08-01T15:46:22Z)
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning [0.9374652839580183]
Identifying the features learned by neural networks is a core challenge in mechanistic interpretability. We propose end-to-end sparse dictionary learning, a method for training SAEs that ensures the features learned are functionally important. We explore geometric and qualitative differences between e2e SAE features and standard SAE features.
arXiv Detail & Related papers (2024-05-17T17:03:46Z)
SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks [88.4408774253634]
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community. There are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers. Recent work has begun to introduce such benchmark for several tasks.
arXiv Detail & Related papers (2022-12-20T18:39:59Z)
Sentence Representation Learning with Generative Objective rather than Contrastive Objective [86.01683892956144]
We propose a novel generative self-supervised learning objective based on phrase reconstruction. Our generative learning achieves powerful enough performance improvement and outperforms the current state-of-the-art contrastive methods.
arXiv Detail & Related papers (2022-10-16T07:47:46Z)
AStitchInLanguageModels: Dataset and Methods for the Exploration of Idiomaticity in Pre-Trained Language Models [7.386862225828819]
This work presents a novel dataset of naturally occurring sentences containing MWEs manually classified into a fine-grained set of meanings. We use this dataset in two tasks designed to test i) a language model's ability to detect idiom usage, and ii) the effectiveness of a language model in generating representations of sentences containing idioms.
arXiv Detail & Related papers (2021-09-09T16:53:17Z)
Infusing Finetuning with Semantic Dependencies [62.37697048781823]
We show that, unlike syntax, semantics is not brought to the surface by today's pretrained models. We then use convolutional graph encoders to explicitly incorporate semantic parses into task-specific finetuning.
arXiv Detail & Related papers (2020-12-10T01:27:24Z)
Exemplar-Controllable Paraphrasing and Translation using Bitext [57.92051459102902]
We adapt models from prior work to be able to learn solely from bilingual text (bitext) Our single proposed model can perform four tasks: controlled paraphrase generation in both languages and controlled machine translation in both language directions.
arXiv Detail & Related papers (2020-10-12T17:02:50Z)
Analysis and Evaluation of Language Models for Word Sense Disambiguation [18.001457030065712]
Transformer-based language models have taken many fields in NLP by storm. BERT can accurately capture high-level sense distinctions, even when a limited number of examples is available for each word sense. BERT and its derivatives dominate most of the existing evaluation benchmarks.
arXiv Detail & Related papers (2020-08-26T15:07:07Z)
Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions [87.33156149634392]
We critically examine RefCOg, a standard benchmark for visual referring expression recognition. We show that 83.7% of test instances do not require reasoning on linguistic structure. We propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT.
arXiv Detail & Related papers (2020-05-04T17:09:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.