Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic
Interpretability: A Case Study on Othello-GPT
- URL: http://arxiv.org/abs/2402.12201v1
- Date: Mon, 19 Feb 2024 15:04:53 GMT
- Title: Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic
Interpretability: A Case Study on Othello-GPT
- Authors: Zhengfu He, Xuyang Ge, Qiong Tang, Tianxiang Sun, Qinyuan Cheng,
Xipeng Qiu
- Abstract summary: We propose a circuit discovery framework alternative to activation patching.
Our framework suffers less from out-of-distribution and proves to be more efficient in terms of complexity.
We dig in a small transformer trained on a synthetic task named Othello and find a number of human-understandable fine-grained circuits inside of it.
- Score: 59.245414547751636
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sparse dictionary learning has been a rapidly growing technique in
mechanistic interpretability to attack superposition and extract more
human-understandable features from model activations. We ask a further question
based on the extracted more monosemantic features: How do we recognize circuits
connecting the enormous amount of dictionary features? We propose a circuit
discovery framework alternative to activation patching. Our framework suffers
less from out-of-distribution and proves to be more efficient in terms of
asymptotic complexity. The basic unit in our framework is dictionary features
decomposed from all modules writing to the residual stream, including
embedding, attention output and MLP output. Starting from any logit, dictionary
feature or attention score, we manage to trace down to lower-level dictionary
features of all tokens and compute their contribution to these more
interpretable and local model behaviors. We dig in a small transformer trained
on a synthetic task named Othello and find a number of human-understandable
fine-grained circuits inside of it.
Related papers
- An Analysis of BPE Vocabulary Trimming in Neural Machine Translation [56.383793805299234]
vocabulary trimming is a postprocessing step that replaces rare subwords with their component subwords.
We show that vocabulary trimming fails to improve performance and is even prone to incurring heavy degradation.
arXiv Detail & Related papers (2024-03-30T15:29:49Z) - Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models [55.19497659895122]
We introduce methods for discovering and applying sparse feature circuits.
These are causally implicatedworks of human-interpretable features for explaining language model behaviors.
arXiv Detail & Related papers (2024-03-28T17:56:07Z) - Continuously Learning New Words in Automatic Speech Recognition [56.972851337263755]
We propose an self-supervised continual learning approach to recognize new words.
We use a memory-enhanced Automatic Speech Recognition model from previous work.
We show that with this approach, we obtain increasing performance on the new words when they occur more frequently.
arXiv Detail & Related papers (2024-01-09T10:39:17Z) - Interpretability at Scale: Identifying Causal Mechanisms in Alpaca [62.65877150123775]
We use Boundless DAS to efficiently search for interpretable causal structure in large language models while they follow instructions.
Our findings mark a first step toward faithfully understanding the inner-workings of our ever-growing and most widely deployed language models.
arXiv Detail & Related papers (2023-05-15T17:15:40Z) - Efficient CNN with uncorrelated Bag of Features pooling [98.78384185493624]
Bag of Features (BoF) has been recently proposed to reduce the complexity of convolution layers.
We propose an approach that builds on top of BoF pooling to boost its efficiency by ensuring that the items of the learned dictionary are non-redundant.
The proposed strategy yields an efficient variant of BoF and further boosts its performance, without any additional parameters.
arXiv Detail & Related papers (2022-09-22T09:00:30Z) - Between words and characters: A Brief History of Open-Vocabulary
Modeling and Tokenization in NLP [22.772546707304766]
We show how hybrid approaches of words and characters as well as subword-based approaches based on learned segmentation have been proposed and evaluated.
We conclude that there is and likely will never be a silver bullet singular solution for all applications.
arXiv Detail & Related papers (2021-12-20T13:04:18Z) - PUDLE: Implicit Acceleration of Dictionary Learning by Backpropagation [4.081440927534577]
This paper offers the first theoretical proof for empirical results through PUDLE, a Provable Unfolded Dictionary LEarning method.
We highlight the minimization impact of loss, unfolding, and backpropagation on convergence.
We complement our findings through synthetic and image denoising experiments.
arXiv Detail & Related papers (2021-05-31T18:49:58Z) - Learning Deep Analysis Dictionaries -- Part II: Convolutional
Dictionaries [38.7315182732103]
We introduce a Deep Convolutional Analysis Dictionary Model (DeepCAM) by learning convolutional dictionaries instead of unstructured dictionaries.
A L-layer DeepCAM consists of L layers of convolutional analysis dictionary and element-wise soft-thresholding pairs.
We demonstrate that DeepCAM is an effective multilayer convolutional model and, on single image super-resolution, achieves performance comparable with other methods.
arXiv Detail & Related papers (2020-01-31T19:02:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.