Related papers: Flash Interpretability: Decoding Specialised Feature Neurons in Large Language Models with the LM-Head

Flash Interpretability: Decoding Specialised Feature Neurons in Large Language Models with the LM-Head

URL: http://arxiv.org/abs/2501.02688v2
Date: Thu, 27 Feb 2025 21:31:36 GMT
Title: Flash Interpretability: Decoding Specialised Feature Neurons in Large Language Models with the LM-Head
Authors: Harry J Davies,
Abstract summary: We show that it is possible to decode neuron weights directly into token probabilities through the final projection layer of a large language model.<n>This is illustrated in Llama 3.1 8B where we use the LM-head to find examples of specialised feature neurons such as a "dog" neuron and a "California" neuron.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) typically have billions of parameters and are thus often difficult to interpret in their operation. In this work, we demonstrate that it is possible to decode neuron weights directly into token probabilities through the final projection layer of the model (the LM-head). This is illustrated in Llama 3.1 8B where we use the LM-head to find examples of specialised feature neurons such as a "dog" neuron and a "California" neuron, and we validate this by clamping these neurons to affect the probability of the concept in the output. We evaluate this method on both the pre-trained and Instruct models, finding that over 75% of neurons in the up-projection layers in the instruct model have the same top associated token compared to the pretrained model. Finally, we demonstrate that clamping the "dog" neuron leads the instruct model to always discuss dogs when asked about its favourite animal. Through our method, it is possible to map the top features of the entirety of Llama 3.1 8B's up-projection neurons in less than 10 seconds, with minimal compute.

Related papers

Neuron Empirical Gradient: Connecting Neurons' Linear Controllability and Representational Capacity [14.693407823048478]
We first reveal the linear relationship between neuron activations and output token probabilities.<n>We investigate whether neuron empirical gradients in PLMs encode general task knowledge by probing skill neurons.
arXiv Detail & Related papers (2024-12-24T00:01:24Z)
Interpreting the Second-Order Effects of Neurons in CLIP [73.54377859089801]
We interpret the function of individual neurons in CLIP by automatically describing them using text. We present the "second-order lens", analyzing the effect flowing from a neuron through the later attention heads, directly to the output. Our results indicate that an automated interpretation of neurons can be used for model deception and for introducing new model capabilities.
arXiv Detail & Related papers (2024-06-06T17:59:52Z)
What Are Large Language Models Mapping to in the Brain? A Case Against Over-Reliance on Brain Scores [1.8175282137722093]
Internal representations from large language models (LLMs) achieve state-of-the-art brain scores, leading to speculation that they share computational principles with human language processing. Here, we analyze three neural datasets used in an impactful study on LLM-to-brain mappings, with a particular focus on an fMRI dataset where participants read short passages. We find that brain scores of trained LLMs on this dataset can largely be explained by sentence length, position, and pronoun-dereferenced static word embeddings.
arXiv Detail & Related papers (2024-06-03T17:13:27Z)
Neuroformer: Multimodal and Multitask Generative Pretraining for Brain Data [3.46029409929709]
State-of-the-art systems neuroscience experiments yield large-scale multimodal data, and these data sets require new tools for analysis. Inspired by the success of large pretrained models in vision and language domains, we reframe the analysis of large-scale, cellular-resolution neuronal spiking data into an autoregressive generation problem. We first trained Neuroformer on simulated datasets, and found that it both accurately predicted intrinsically simulated neuronal circuit activity, and also inferred the underlying neural circuit connectivity, including direction.
arXiv Detail & Related papers (2023-10-31T20:17:32Z)
WaLiN-GUI: a graphical and auditory tool for neuron-based encoding [73.88751967207419]
Neuromorphic computing relies on spike-based, energy-efficient communication. We develop a tool to identify suitable configurations for neuron-based encoding of sample-based data into spike trains. The WaLiN-GUI is provided open source and with documentation.
arXiv Detail & Related papers (2023-10-25T20:34:08Z)
The Expressive Leaky Memory Neuron: an Efficient and Expressive Phenomenological Neuron Model Can Solve Long-Horizon Tasks [64.08042492426992]
We introduce the Expressive Memory (ELM) neuron model, a biologically inspired model of a cortical neuron. Our ELM neuron can accurately match the aforementioned input-output relationship with under ten thousand trainable parameters. We evaluate it on various tasks with demanding temporal structures, including the Long Range Arena (LRA) datasets.
arXiv Detail & Related papers (2023-06-14T13:34:13Z)
Neuron to Graph: Interpreting Language Model Neurons at Scale [8.32093320910416]
This paper introduces a novel automated approach designed to scale interpretability techniques across a vast array of neurons within Large Language Models. We propose Neuron to Graph (N2G), an innovative tool that automatically extracts a neuron's behaviour from the dataset it was trained on and translates it into an interpretable graph.
arXiv Detail & Related papers (2023-05-31T14:44:33Z)
Finding Neurons in a Haystack: Case Studies with Sparse Probing [2.278231643598956]
Internal computations of large language models (LLMs) remain opaque and poorly understood. We train $k$-sparse linear classifiers to predict the presence of features in the input. By varying the value of $k$ we study the sparsity of learned representations and how this varies with model scale.
arXiv Detail & Related papers (2023-05-02T17:13:55Z)
N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models [0.0]
N2G is a tool which takes a neuron and its dataset examples, and automatically distills the neuron's behaviour on those examples to an interpretable graph. We use truncation and saliency methods to only present the important tokens, and augment the dataset examples with more diverse samples to better capture the extent of neuron behaviour. These graphs can be visualised to aid manual interpretation by researchers, but can also output token activations on text to compare to the neuron's ground truth activations for automatic validation.
arXiv Detail & Related papers (2023-04-22T19:06:13Z)
Cross-Model Comparative Loss for Enhancing Neuronal Utility in Language Understanding [82.46024259137823]
We propose a cross-model comparative loss for a broad range of tasks. We demonstrate the universal effectiveness of comparative loss through extensive experiments on 14 datasets from 3 distinct NLU tasks.
arXiv Detail & Related papers (2023-01-10T03:04:27Z)
Redundant representations help generalization in wide neural networks [71.38860635025907]
We study the last hidden layer representations of various state-of-the-art convolutional neural networks. We find that if the last hidden representation is wide enough, its neurons tend to split into groups that carry identical information, and differ from each other only by statistically independent noise.
arXiv Detail & Related papers (2021-06-07T10:18:54Z)
The Neural Coding Framework for Learning Generative Models [91.0357317238509]
We propose a novel neural generative model inspired by the theory of predictive processing in the brain. In a similar way, artificial neurons in our generative model predict what neighboring neurons will do, and adjust their parameters based on how well the predictions matched reality.
arXiv Detail & Related papers (2020-12-07T01:20:38Z)
Non-linear Neurons with Human-like Apical Dendrite Activations [81.18416067005538]
We show that a standard neuron followed by our novel apical dendrite activation (ADA) can learn the XOR logical function with 100% accuracy. We conduct experiments on six benchmark data sets from computer vision, signal processing and natural language processing.
arXiv Detail & Related papers (2020-02-02T21:09:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.