Related papers: Finding Neurons in a Haystack: Case Studies with Sparse Probing

Finding Neurons in a Haystack: Case Studies with Sparse Probing

URL: http://arxiv.org/abs/2305.01610v2
Date: Fri, 2 Jun 2023 21:52:17 GMT
Title: Finding Neurons in a Haystack: Case Studies with Sparse Probing
Authors: Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, Dimitris Bertsimas
Abstract summary: Internal computations of large language models (LLMs) remain opaque and poorly understood. We train $k$-sparse linear classifiers to predict the presence of features in the input. By varying the value of $k$ we study the sparsity of learned representations and how this varies with model scale.
Score: 2.278231643598956
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite rapid adoption and deployment of large language models (LLMs), the internal computations of these models remain opaque and poorly understood. In this work, we seek to understand how high-level human-interpretable features are represented within the internal neuron activations of LLMs. We train $k$-sparse linear classifiers (probes) on these internal activations to predict the presence of features in the input; by varying the value of $k$ we study the sparsity of learned representations and how this varies with model scale. With $k=1$, we localize individual neurons which are highly relevant for a particular feature, and perform a number of case studies to illustrate general properties of LLMs. In particular, we show that early layers make use of sparse combinations of neurons to represent many features in superposition, that middle layers have seemingly dedicated neurons to represent higher-level contextual features, and that increasing scale causes representational sparsity to increase on average, but there are multiple types of scaling dynamics. In all, we probe for over 100 unique features comprising 10 different categories in 7 different models spanning 70 million to 6.9 billion parameters.

Related papers

Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$P Parametrization [66.03821840425539]
In this paper, we investigate the training dynamics of $L$-layer neural networks using the tensor gradient program (SGD) framework. We show that SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum.
arXiv Detail & Related papers (2025-03-12T17:33:13Z)
Flash Interpretability: Decoding Specialised Feature Neurons in Large Language Models with the LM-Head [0.0]
We show that it is possible to decode neuron weights directly into token probabilities through the final projection layer of a large language model. This is illustrated in Llama 3.1 8B where we use the LM-head to find examples of specialised feature neurons such as a "dog" neuron and a "California" neuron.
arXiv Detail & Related papers (2025-01-05T23:35:47Z)
Neuron-based Personality Trait Induction in Large Language Models [115.08894603023712]
Large language models (LLMs) have become increasingly proficient at simulating various personality traits. We present a neuron-based approach for personality trait induction in LLMs.
arXiv Detail & Related papers (2024-10-16T07:47:45Z)
Exploring Behavior-Relevant and Disentangled Neural Dynamics with Generative Diffusion Models [2.600709013150986]
Understanding the neural basis of behavior is a fundamental goal in neuroscience. Our approach, named BeNeDiff'', first identifies a fine-grained and disentangled neural subspace. It then employs state-of-the-art generative diffusion models to synthesize behavior videos that interpret the neural dynamics of each latent factor.
arXiv Detail & Related papers (2024-10-12T18:28:56Z)
Modularity in Transformers: Investigating Neuron Separability & Specialization [0.0]
Transformer models are increasingly prevalent in various applications, yet our understanding of their internal workings remains limited. This paper investigates the modularity and task specialization of neurons within transformer architectures, focusing on both vision (ViT) and language (Mistral 7B) models. Using a combination of selective pruning and MoEfication clustering techniques, we analyze the overlap and specialization of neurons across different tasks and data subsets.
arXiv Detail & Related papers (2024-08-30T14:35:01Z)
SPIN: Sparsifying and Integrating Internal Neurons in Large Language Models for Text Classification [6.227343685358882]
We present a model-agnostic framework that sparsifies and integrates internal neurons of intermediate layers of Large Language Models for text classification. SPIN significantly improves text classification accuracy, efficiency, and interpretability.
arXiv Detail & Related papers (2023-11-27T16:28:20Z)
Multilayer Multiset Neuronal Networks -- MMNNs [55.2480439325792]
The present work describes multilayer multiset neuronal networks incorporating two or more layers of coincidence similarity neurons. The work also explores the utilization of counter-prototype points, which are assigned to the image regions to be avoided.
arXiv Detail & Related papers (2023-08-28T12:55:13Z)
The Expressive Leaky Memory Neuron: an Efficient and Expressive Phenomenological Neuron Model Can Solve Long-Horizon Tasks [64.08042492426992]
We introduce the Expressive Memory (ELM) neuron model, a biologically inspired model of a cortical neuron. Our ELM neuron can accurately match the aforementioned input-output relationship with under ten thousand trainable parameters. We evaluate it on various tasks with demanding temporal structures, including the Long Range Arena (LRA) datasets.
arXiv Detail & Related papers (2023-06-14T13:34:13Z)
Learning Low Dimensional State Spaces with Overparameterized Recurrent Neural Nets [57.06026574261203]
We provide theoretical evidence for learning low-dimensional state spaces, which can also model long-term memory. Experiments corroborate our theory, demonstrating extrapolation via learning low-dimensional state spaces with both linear and non-linear RNNs.
arXiv Detail & Related papers (2022-10-25T14:45:15Z)
Understanding Neural Coding on Latent Manifolds by Sharing Features and Dividing Ensembles [3.625425081454343]
Systems neuroscience relies on two complementary views of neural data, characterized by single neuron tuning curves and analysis of population activity. These two perspectives combine elegantly in neural latent variable models that constrain the relationship between latent variables and neural activity. We propose feature sharing across neural tuning curves, which significantly improves performance and leads to better-behaved optimization.
arXiv Detail & Related papers (2022-10-06T18:37:49Z)
Simple and complex spiking neurons: perspectives and analysis in a simple STDP scenario [0.7829352305480283]
Spiking neural networks (SNNs) are inspired by biology and neuroscience to create fast and efficient learning systems. This work considers various neuron models in the literature and then selects computational neuron models that are single-variable, efficient, and display different types of complexities. We make a comparative study of three simple I&F neuron models, namely the LIF, the Quadratic I&F (QIF) and the Exponential I&F (EIF), to understand whether the use of more complex models increases the performance of the system.
arXiv Detail & Related papers (2022-06-28T10:01:51Z)
The Neural Coding Framework for Learning Generative Models [91.0357317238509]
We propose a novel neural generative model inspired by the theory of predictive processing in the brain. In a similar way, artificial neurons in our generative model predict what neighboring neurons will do, and adjust their parameters based on how well the predictions matched reality.
arXiv Detail & Related papers (2020-12-07T01:20:38Z)
Non-linear Neurons with Human-like Apical Dendrite Activations [81.18416067005538]
We show that a standard neuron followed by our novel apical dendrite activation (ADA) can learn the XOR logical function with 100% accuracy. We conduct experiments on six benchmark data sets from computer vision, signal processing and natural language processing.
arXiv Detail & Related papers (2020-02-02T21:09:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.