Sparse Autoencoders Find Highly Interpretable Features in Language
Models
- URL: http://arxiv.org/abs/2309.08600v3
- Date: Wed, 4 Oct 2023 13:17:38 GMT
- Title: Sparse Autoencoders Find Highly Interpretable Features in Language
Models
- Authors: Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey
- Abstract summary: Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally.
We use sparse autoencoders to reconstruct the internal activations of a language model.
Our method may serve as a foundation for future mechanistic interpretability work.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: One of the roadblocks to a better understanding of neural networks' internals
is \textit{polysemanticity}, where neurons appear to activate in multiple,
semantically distinct contexts. Polysemanticity prevents us from identifying
concise, human-understandable explanations for what neural networks are doing
internally. One hypothesised cause of polysemanticity is
\textit{superposition}, where neural networks represent more features than they
have neurons by assigning features to an overcomplete set of directions in
activation space, rather than to individual neurons. Here, we attempt to
identify those directions, using sparse autoencoders to reconstruct the
internal activations of a language model. These autoencoders learn sets of
sparsely activating features that are more interpretable and monosemantic than
directions identified by alternative approaches, where interpretability is
measured by automated methods. Moreover, we show that with our learned set of
features, we can pinpoint the features that are causally responsible for
counterfactual behaviour on the indirect object identification task
\citep{wang2022interpretability} to a finer degree than previous
decompositions. This work indicates that it is possible to resolve
superposition in language models using a scalable, unsupervised method. Our
method may serve as a foundation for future mechanistic interpretability work,
which we hope will enable greater model transparency and steerability.
Related papers
- Automatically Interpreting Millions of Features in Large Language Models [1.8035046415192353]
sparse autoencoders (SAEs) can be used to transform activations into a higher-dimensional latent space.
We build an open-source pipeline to generate and evaluate natural language explanations for SAE features.
Our large-scale analysis confirms that SAE latents are indeed much more interpretable than neurons.
arXiv Detail & Related papers (2024-10-17T17:56:01Z) - Using Degeneracy in the Loss Landscape for Mechanistic Interpretability [0.0]
Mechanistic Interpretability aims to reverse engineer the algorithms implemented by neural networks by studying their weights and activations.
An obstacle to reverse engineering neural networks is that many of the parameters inside a network are not involved in the computation being implemented by the network.
arXiv Detail & Related papers (2024-05-17T17:26:33Z) - Semantic Loss Functions for Neuro-Symbolic Structured Prediction [74.18322585177832]
We discuss the semantic loss, which injects knowledge about such structure, defined symbolically, into training.
It is agnostic to the arrangement of the symbols, and depends only on the semantics expressed thereby.
It can be combined with both discriminative and generative neural models.
arXiv Detail & Related papers (2024-05-12T22:18:25Z) - Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models [55.19497659895122]
We introduce methods for discovering and applying sparse feature circuits.
These are causally implicatedworks of human-interpretable features for explaining language model behaviors.
arXiv Detail & Related papers (2024-03-28T17:56:07Z) - Identifying Interpretable Visual Features in Artificial and Biological
Neural Systems [3.604033202771937]
Single neurons in neural networks are often interpretable in that they represent individual, intuitively meaningful features.
Many neurons exhibit $textitmixed selectivity$, i.e., they represent multiple unrelated features.
We propose an automated method for quantifying visual interpretability and an approach for finding meaningful directions in network activation space.
arXiv Detail & Related papers (2023-10-17T17:41:28Z) - DISCOVER: Making Vision Networks Interpretable via Competition and
Dissection [11.028520416752325]
This work contributes to post-hoc interpretability, and specifically Network Dissection.
Our goal is to present a framework that makes it easier to discover the individual functionality of each neuron in a network trained on a vision task.
arXiv Detail & Related papers (2023-10-07T21:57:23Z) - Emergence of Machine Language: Towards Symbolic Intelligence with Neural
Networks [73.94290462239061]
We propose to combine symbolism and connectionism principles by using neural networks to derive a discrete representation.
By designing an interactive environment and task, we demonstrated that machines could generate a spontaneous, flexible, and semantic language.
arXiv Detail & Related papers (2022-01-14T14:54:58Z) - Dynamic Inference with Neural Interpreters [72.90231306252007]
We present Neural Interpreters, an architecture that factorizes inference in a self-attention network as a system of modules.
inputs to the model are routed through a sequence of functions in a way that is end-to-end learned.
We show that Neural Interpreters perform on par with the vision transformer using fewer parameters, while being transferrable to a new task in a sample efficient manner.
arXiv Detail & Related papers (2021-10-12T23:22:45Z) - Low-Dimensional Structure in the Space of Language Representations is
Reflected in Brain Responses [62.197912623223964]
We show a low-dimensional structure where language models and translation models smoothly interpolate between word embeddings, syntactic and semantic tasks, and future word embeddings.
We find that this representation embedding can predict how well each individual feature space maps to human brain responses to natural language stimuli recorded using fMRI.
This suggests that the embedding captures some part of the brain's natural language representation structure.
arXiv Detail & Related papers (2021-06-09T22:59:12Z) - Compositional Explanations of Neurons [52.71742655312625]
We describe a procedure for explaining neurons in deep representations by identifying compositional logical concepts.
We use this procedure to answer several questions on interpretability in models for vision and natural language processing.
arXiv Detail & Related papers (2020-06-24T20:37:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.