A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
- URL: http://arxiv.org/abs/2409.14507v4
- Date: Mon, 30 Sep 2024 20:42:22 GMT
- Title: A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
- Authors: David Chanin, James Wilken-Smith, Tomáš Dulka, Hardik Bhatnagar, Joseph Bloom,
- Abstract summary: Sparse Autoencoders (SAEs) have emerged as a promising approach to decompose the activations of Large Language Models (LLMs)
In this paper, we pose two questions. First, to what extent do SAEs extract monosemantic and interpretable latents?
Second, to what extent does varying the sparsity or the size of the SAE affect monosemanticity / interpretability?
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sparse Autoencoders (SAEs) have emerged as a promising approach to decompose the activations of Large Language Models (LLMs) into human-interpretable latents. In this paper, we pose two questions. First, to what extent do SAEs extract monosemantic and interpretable latents? Second, to what extent does varying the sparsity or the size of the SAE affect monosemanticity / interpretability? By investigating these questions in the context of a simple first-letter identification task where we have complete access to ground truth labels for all tokens in the vocabulary, we are able to provide more detail than prior investigations. Critically, we identify a problematic form of feature-splitting we call feature absorption where seemingly monosemantic latents fail to fire in cases where they clearly should. Our investigation suggests that varying SAE size or sparsity is insufficient to solve this issue, and that there are deeper conceptual issues in need of resolution.
Related papers
- Sparse Autoencoders Do Not Find Canonical Units of Analysis [6.0188420022822955]
A common goal of mechanistic interpretability is to decompose the activations of neural networks into features.
Sparse autoencoders (SAEs) are a popular method for finding these features.
We use two techniques: SAE stitching to show they are incomplete, and meta-SAEs to show they are not atomic.
arXiv Detail & Related papers (2025-02-07T12:33:08Z) - Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words [29.09237503747052]
Sparse autoencoders (SAEs) have gained a lot of attention as a promising tool to improve the interpretability of large language models (LLMs)
In this paper, we propose a suite of evaluations for SAEs to analyze the quality of monosemantic features by focusing on polysemous words.
arXiv Detail & Related papers (2025-01-09T02:54:19Z) - Content-Style Learning from Unaligned Domains: Identifiability under Unknown Latent Dimensions [11.872336932802844]
We introduce a new analytical framework via cross-domain textitlatent distribution matching (LDM)
We show that restrictive assumptions such as component-wise independence of the latent variables can be removed.
We recast the LDM formulation into a regularized multi-domain GAN loss with coupled latent variables.
arXiv Detail & Related papers (2024-11-06T08:30:23Z) - Interpreting Attention Layer Outputs with Sparse Autoencoders [3.201633659481912]
Decomposing model activations into interpretable components is a key open problem in mechanistic interpretability.
In this work we train SAEs on attention layer outputs and show that also here SAEs find a sparse, interpretable decomposition.
We show that Sparse Autoencoders are a useful tool that enable researchers to explain model behavior in greater detail than prior work.
arXiv Detail & Related papers (2024-06-25T17:43:13Z) - Crafting Interpretable Embeddings by Asking LLMs Questions [89.49960984640363]
Large language models (LLMs) have rapidly improved text embeddings for a growing array of natural-language processing tasks.
We introduce question-answering embeddings (QA-Emb), embeddings where each feature represents an answer to a yes/no question asked to an LLM.
We use QA-Emb to flexibly generate interpretable models for predicting fMRI voxel responses to language stimuli.
arXiv Detail & Related papers (2024-05-26T22:30:29Z) - Aligning Language Models to Explicitly Handle Ambiguity [22.078095273053506]
We propose Alignment with Perceived Ambiguity (APA), a novel pipeline that aligns language models to deal with ambiguous queries.
Experimental results on question-answering datasets demonstrate that APA empowers LLMs to explicitly detect and manage ambiguous queries.
Our finding proves that APA excels beyond training with gold-standard labels, especially in out-of-distribution scenarios.
arXiv Detail & Related papers (2024-04-18T07:59:53Z) - SEMQA: Semi-Extractive Multi-Source Question Answering [94.04430035121136]
We introduce a new QA task for answering multi-answer questions by summarizing multiple diverse sources in a semi-extractive fashion.
We create the first dataset of this kind, QuoteSum, with human-written semi-extractive answers to natural and generated questions.
arXiv Detail & Related papers (2023-11-08T18:46:32Z) - We're Afraid Language Models Aren't Modeling Ambiguity [136.8068419824318]
Managing ambiguity is a key part of human language understanding.
We characterize ambiguity in a sentence by its effect on entailment relations with another sentence.
We show that a multilabel NLI model can flag political claims in the wild that are misleading due to ambiguity.
arXiv Detail & Related papers (2023-04-27T17:57:58Z) - A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes.
We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z) - Where and What? Examining Interpretable Disentangled Representations [96.32813624341833]
Capturing interpretable variations has long been one of the goals in disentanglement learning.
Unlike the independence assumption, interpretability has rarely been exploited to encourage disentanglement in the unsupervised setting.
In this paper, we examine the interpretability of disentangled representations by investigating two questions: where to be interpreted and what to be interpreted.
arXiv Detail & Related papers (2021-04-07T11:22:02Z) - Dive into Ambiguity: Latent Distribution Mining and Pairwise Uncertainty
Estimation for Facial Expression Recognition [59.52434325897716]
We propose a solution, named DMUE, to address the problem of annotation ambiguity from two perspectives.
For the former, an auxiliary multi-branch learning framework is introduced to better mine and describe the latent distribution in the label space.
For the latter, the pairwise relationship of semantic feature between instances are fully exploited to estimate the ambiguity extent in the instance space.
arXiv Detail & Related papers (2021-04-01T03:21:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.