Related papers: N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models

N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models

URL: http://arxiv.org/abs/2304.12918v1
Date: Sat, 22 Apr 2023 19:06:13 GMT
Title: N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models
Authors: Alex Foote, Neel Nanda, Esben Kran, Ionnis Konstas, Fazl Barez
Abstract summary: N2G is a tool which takes a neuron and its dataset examples, and automatically distills the neuron's behaviour on those examples to an interpretable graph. We use truncation and saliency methods to only present the important tokens, and augment the dataset examples with more diverse samples to better capture the extent of neuron behaviour. These graphs can be visualised to aid manual interpretation by researchers, but can also output token activations on text to compare to the neuron's ground truth activations for automatic validation.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding the function of individual neurons within language models is essential for mechanistic interpretability research. We propose $\textbf{Neuron to Graph (N2G)}$, a tool which takes a neuron and its dataset examples, and automatically distills the neuron's behaviour on those examples to an interpretable graph. This presents a less labour intensive approach to interpreting neurons than current manual methods, that will better scale these methods to Large Language Models (LLMs). We use truncation and saliency methods to only present the important tokens, and augment the dataset examples with more diverse samples to better capture the extent of neuron behaviour. These graphs can be visualised to aid manual interpretation by researchers, but can also output token activations on text to compare to the neuron's ground truth activations for automatic validation. N2G represents a step towards scalable interpretability methods by allowing us to convert neurons in an LLM to interpretable representations of measurable quality.

Related papers

Probing Neural Topology of Large Language Models [15.34202977968525]
We introduce graph probing, a method for uncovering the functional connectivity topology of LLM neurons.<n>We find a universal predictability of next-token prediction performance using only neural topology.<n>This predictability is robust even when retaining just 1% of neuron connections or probing models after only 8 pretraining steps.
arXiv Detail & Related papers (2025-06-01T14:57:03Z)
Revisiting Large Language Model Pruning using Neuron Semantic Attribution [63.62836612864512]
We conduct evaluations on 24 datasets and 4 tasks using popular pruning methods. We surprisingly find a significant performance drop of existing pruning methods in sentiment classification tasks. We propose Neuron Semantic Attribution, which learns to associate each neuron with specific semantics.
arXiv Detail & Related papers (2025-03-03T13:52:17Z)
Neuron Empirical Gradient: Discovering and Quantifying Neurons Global Linear Controllability [14.693407823048478]
Our study first investigates the numerical relationship between neuron activations and model output. We introduce NeurGrad, an accurate and efficient method for computing neuron empirical gradient (NEG)
arXiv Detail & Related papers (2024-12-24T00:01:24Z)
Tackling Polysemanticity with Neuron Embeddings [0.0]
We present neuron embeddings, a representation that can be used to tackle polysemanticity. We apply our method to GPT2-small, and provide a UI for exploring the results.
arXiv Detail & Related papers (2024-11-12T20:19:39Z)
Interpretable Language Modeling via Induction-head Ngram Models [74.26720927767398]
We propose Induction-head ngram models (Induction-Gram) to bolster modern ngram models with a hand-engineered "induction head" This induction head uses a custom neural similarity metric to efficiently search the model's input context for potential next-word completions. Experiments show that this simple method significantly improves next-word prediction over baseline interpretable models.
arXiv Detail & Related papers (2024-10-31T12:33:26Z)
Interpreting the Second-Order Effects of Neurons in CLIP [73.54377859089801]
We interpret the function of individual neurons in CLIP by automatically describing them using text. We present the "second-order lens", analyzing the effect flowing from a neuron through the later attention heads, directly to the output. Our results indicate that a scalable understanding of neurons can be used for model deception and for introducing new model capabilities.
arXiv Detail & Related papers (2024-06-06T17:59:52Z)
Towards Generating Informative Textual Description for Neurons in Language Models [6.884227665279812]
We propose a framework that ties textual descriptions to neurons. In particular, our experiment shows that the proposed approach achieves 75% precision@2, and 50% recall@2
arXiv Detail & Related papers (2024-01-30T04:06:25Z)
Investigating the Encoding of Words in BERT's Neurons using Feature Textualization [11.943486282441143]
We propose a technique to produce representations of neurons in embedding word space. We find that the produced representations can provide insights about the encoded knowledge in individual neurons.
arXiv Detail & Related papers (2023-11-14T15:21:49Z)
WaLiN-GUI: a graphical and auditory tool for neuron-based encoding [73.88751967207419]
Neuromorphic computing relies on spike-based, energy-efficient communication. We develop a tool to identify suitable configurations for neuron-based encoding of sample-based data into spike trains. The WaLiN-GUI is provided open source and with documentation.
arXiv Detail & Related papers (2023-10-25T20:34:08Z)
Identifying Interpretable Visual Features in Artificial and Biological Neural Systems [3.604033202771937]
Single neurons in neural networks are often interpretable in that they represent individual, intuitively meaningful features. Many neurons exhibit $textitmixed selectivity$, i.e., they represent multiple unrelated features. We propose an automated method for quantifying visual interpretability and an approach for finding meaningful directions in network activation space.
arXiv Detail & Related papers (2023-10-17T17:41:28Z)
Neuron to Graph: Interpreting Language Model Neurons at Scale [8.32093320910416]
This paper introduces a novel automated approach designed to scale interpretability techniques across a vast array of neurons within Large Language Models. We propose Neuron to Graph (N2G), an innovative tool that automatically extracts a neuron's behaviour from the dataset it was trained on and translates it into an interpretable graph.
arXiv Detail & Related papers (2023-05-31T14:44:33Z)
Self-Supervised Graph Representation Learning for Neuronal Morphologies [75.38832711445421]
We present GraphDINO, a data-driven approach to learn low-dimensional representations of 3D neuronal morphologies from unlabeled datasets. We show, in two different species and across multiple brain areas, that this method yields morphological cell type clusterings on par with manual feature-based classification by experts. Our method could potentially enable data-driven discovery of novel morphological features and cell types in large-scale datasets.
arXiv Detail & Related papers (2021-12-23T12:17:47Z)
The Neural Coding Framework for Learning Generative Models [91.0357317238509]
We propose a novel neural generative model inspired by the theory of predictive processing in the brain. In a similar way, artificial neurons in our generative model predict what neighboring neurons will do, and adjust their parameters based on how well the predictions matched reality.
arXiv Detail & Related papers (2020-12-07T01:20:38Z)
Compositional Explanations of Neurons [52.71742655312625]
We describe a procedure for explaining neurons in deep representations by identifying compositional logical concepts. We use this procedure to answer several questions on interpretability in models for vision and natural language processing.
arXiv Detail & Related papers (2020-06-24T20:37:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.