Partially Rewriting a Transformer in Natural Language
- URL: http://arxiv.org/abs/2501.18838v1
- Date: Fri, 31 Jan 2025 01:12:50 GMT
- Title: Partially Rewriting a Transformer in Natural Language
- Authors: Gonçalo Paulo, Nora Belrose,
- Abstract summary: We attempt to partially rewrite a large language model using simple natural language explanations.
We replace the first layer of this sparse with an LLM-based simulator, which predicts the activation of each neuron.
We measure the degree to which these modifications distort the model's final output.
- Score: 0.7234862895932991
- License:
- Abstract: The greatest ambition of mechanistic interpretability is to completely rewrite deep neural networks in a format that is more amenable to human understanding, while preserving their behavior and performance. In this paper, we attempt to partially rewrite a large language model using simple natural language explanations. We first approximate one of the feedforward networks in the LLM with a wider MLP with sparsely activating neurons - a transcoder - and use an automated interpretability pipeline to generate explanations for these neurons. We then replace the first layer of this sparse MLP with an LLM-based simulator, which predicts the activation of each neuron given its explanation and the surrounding context. Finally, we measure the degree to which these modifications distort the model's final output. With our pipeline, the model's increase in loss is statistically similar to entirely replacing the sparse MLP output with the zero vector. We employ the same protocol, this time using a sparse autoencoder, on the residual stream of the same layer and obtain similar results. These results suggest that more detailed explanations are needed to improve performance substantially above the zero ablation baseline.
Related papers
- FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step.
We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z) - Transcoders Find Interpretable LLM Feature Circuits [1.4254279830438588]
We introduce a novel method for using transcoders to perform circuit analysis through sublayers.
We train transcoders on language models with 120M, 410M, and 1.4B parameters, and find them to perform at least on par with SAEs in terms of sparsity, faithfulness, and human-interpretability.
arXiv Detail & Related papers (2024-06-17T17:49:00Z) - Transformers need glasses! Information over-squashing in language tasks [18.81066657470662]
We study how information propagates in decoder-only Transformers.
We show that certain sequences of inputs to the Transformer can yield arbitrarily close representations in the final token.
We also show that decoder-only Transformer language models can lose sensitivity to specific tokens in the input.
arXiv Detail & Related papers (2024-06-06T17:14:44Z) - Return of the RNN: Residual Recurrent Networks for Invertible Sentence
Embeddings [0.0]
This study presents a novel model for invertible sentence embeddings using a residual recurrent network trained on an unsupervised encoding task.
Rather than the probabilistic outputs common to neural machine translation models, our approach employs a regression-based output layer to reconstruct the input sequence's word vectors.
The model achieves high accuracy and fast training with the ADAM, a significant finding given that RNNs typically require memory units, such as LSTMs, or second-order optimization methods.
arXiv Detail & Related papers (2023-03-23T15:59:06Z) - Why do Nearest Neighbor Language Models Work? [93.71050438413121]
Language models (LMs) compute the probability of a text by sequentially computing a representation of an already-seen context.
Retrieval-augmented LMs have shown to improve over standard neural LMs, by accessing information retrieved from a large datastore.
arXiv Detail & Related papers (2023-01-07T11:12:36Z) - Transformer Feed-Forward Layers Build Predictions by Promoting Concepts
in the Vocabulary Space [49.029910567673824]
Transformer-based language models (LMs) are at the core of modern NLP, but their internal prediction construction process is opaque and largely not understood.
We make a substantial step towards unveiling this underlying prediction process, by reverse-engineering the operation of the feed-forward network (FFN) layers.
arXiv Detail & Related papers (2022-03-28T12:26:00Z) - Efficient Language Modeling with Sparse all-MLP [53.81435968051093]
All-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks.
We propose sparse all-MLPs with mixture-of-experts (MoEs) in both feature and input (tokens)
We evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.
arXiv Detail & Related papers (2022-03-14T04:32:19Z) - Revisiting Simple Neural Probabilistic Language Models [27.957834093475686]
This paper revisits the neural probabilistic language model (NPLM) ofcitetBengio2003ANP.
When scaled up to modern hardware, this model performs much better than expected on word-level language model benchmarks.
Inspired by this result, we modify the Transformer by replacing its first self-attention layer with the NPLM's local concatenation layer.
arXiv Detail & Related papers (2021-04-08T02:18:47Z) - Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex.
This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z) - Glancing Transformer for Non-Autoregressive Neural Machine Translation [58.87258329683682]
We propose a method to learn word interdependency for single-pass parallel generation models.
With only single-pass parallel decoding, GLAT is able to generate high-quality translation with 8-15 times speedup.
arXiv Detail & Related papers (2020-08-18T13:04:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.