Related papers: Neuron to Graph: Interpreting Language Model Neurons at Scale

Neuron to Graph: Interpreting Language Model Neurons at Scale

URL: http://arxiv.org/abs/2305.19911v1
Date: Wed, 31 May 2023 14:44:33 GMT
Title: Neuron to Graph: Interpreting Language Model Neurons at Scale
Authors: Alex Foote, Neel Nanda, Esben Kran, Ioannis Konstas, Shay Cohen, Fazl Barez
Abstract summary: This paper introduces a novel automated approach designed to scale interpretability techniques across a vast array of neurons within Large Language Models. We propose Neuron to Graph (N2G), an innovative tool that automatically extracts a neuron's behaviour from the dataset it was trained on and translates it into an interpretable graph.
Score: 8.32093320910416
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Advances in Large Language Models (LLMs) have led to remarkable capabilities, yet their inner mechanisms remain largely unknown. To understand these models, we need to unravel the functions of individual neurons and their contribution to the network. This paper introduces a novel automated approach designed to scale interpretability techniques across a vast array of neurons within LLMs, to make them more interpretable and ultimately safe. Conventional methods require examination of examples with strong neuron activation and manual identification of patterns to decipher the concepts a neuron responds to. We propose Neuron to Graph (N2G), an innovative tool that automatically extracts a neuron's behaviour from the dataset it was trained on and translates it into an interpretable graph. N2G uses truncation and saliency methods to emphasise only the most pertinent tokens to a neuron while enriching dataset examples with diverse samples to better encompass the full spectrum of neuron behaviour. These graphs can be visualised to aid researchers' manual interpretation, and can generate token activations on text for automatic validation by comparison with the neuron's ground truth activations, which we use to show that the model is better at predicting neuron activation than two baseline methods. We also demonstrate how the generated graph representations can be flexibly used to facilitate further automation of interpretability research, by searching for neurons with particular properties, or programmatically comparing neurons to each other to identify similar neurons. Our method easily scales to build graph representations for all neurons in a 6-layer Transformer model using a single Tesla T4 GPU, allowing for wide usability. We release the code and instructions for use at https://github.com/alexjfoote/Neuron2Graph.

Related papers

Interpreting the Second-Order Effects of Neurons in CLIP [73.54377859089801]
We interpret the function of individual neurons in CLIP by automatically describing them using text. We present the "second-order lens", analyzing the effect flowing from a neuron through the later attention heads, directly to the output. Our results indicate that a scalable understanding of neurons can be used for model deception and for introducing new model capabilities.
arXiv Detail & Related papers (2024-06-06T17:59:52Z)
No One-Size-Fits-All Neurons: Task-based Neurons for Artificial Neural Networks [25.30801109401654]
Since the human brain is a task-based neuron user, can the artificial network design go from the task-based architecture design to the task-based neuron design? We propose a two-step framework for prototyping task-based neurons. Experiments show that the proposed task-based neuron design is not only feasible but also delivers competitive performance over other state-of-the-art models.
arXiv Detail & Related papers (2024-05-03T09:12:46Z)
Investigating the Encoding of Words in BERT's Neurons using Feature Textualization [11.943486282441143]
We propose a technique to produce representations of neurons in embedding word space. We find that the produced representations can provide insights about the encoded knowledge in individual neurons.
arXiv Detail & Related papers (2023-11-14T15:21:49Z)
WaLiN-GUI: a graphical and auditory tool for neuron-based encoding [73.88751967207419]
Neuromorphic computing relies on spike-based, energy-efficient communication. We develop a tool to identify suitable configurations for neuron-based encoding of sample-based data into spike trains. The WaLiN-GUI is provided open source and with documentation.
arXiv Detail & Related papers (2023-10-25T20:34:08Z)
Identifying Interpretable Visual Features in Artificial and Biological Neural Systems [3.604033202771937]
Single neurons in neural networks are often interpretable in that they represent individual, intuitively meaningful features. Many neurons exhibit $textitmixed selectivity$, i.e., they represent multiple unrelated features. We propose an automated method for quantifying visual interpretability and an approach for finding meaningful directions in network activation space.
arXiv Detail & Related papers (2023-10-17T17:41:28Z)
N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models [0.0]
N2G is a tool which takes a neuron and its dataset examples, and automatically distills the neuron's behaviour on those examples to an interpretable graph. We use truncation and saliency methods to only present the important tokens, and augment the dataset examples with more diverse samples to better capture the extent of neuron behaviour. These graphs can be visualised to aid manual interpretation by researchers, but can also output token activations on text to compare to the neuron's ground truth activations for automatic validation.
arXiv Detail & Related papers (2023-04-22T19:06:13Z)
Self-Supervised Graph Representation Learning for Neuronal Morphologies [75.38832711445421]
We present GraphDINO, a data-driven approach to learn low-dimensional representations of 3D neuronal morphologies from unlabeled datasets. We show, in two different species and across multiple brain areas, that this method yields morphological cell type clusterings on par with manual feature-based classification by experts. Our method could potentially enable data-driven discovery of novel morphological features and cell types in large-scale datasets.
arXiv Detail & Related papers (2021-12-23T12:17:47Z)
The Neural Coding Framework for Learning Generative Models [91.0357317238509]
We propose a novel neural generative model inspired by the theory of predictive processing in the brain. In a similar way, artificial neurons in our generative model predict what neighboring neurons will do, and adjust their parameters based on how well the predictions matched reality.
arXiv Detail & Related papers (2020-12-07T01:20:38Z)
Compositional Explanations of Neurons [52.71742655312625]
We describe a procedure for explaining neurons in deep representations by identifying compositional logical concepts. We use this procedure to answer several questions on interpretability in models for vision and natural language processing.
arXiv Detail & Related papers (2020-06-24T20:37:05Z)
Non-linear Neurons with Human-like Apical Dendrite Activations [81.18416067005538]
We show that a standard neuron followed by our novel apical dendrite activation (ADA) can learn the XOR logical function with 100% accuracy. We conduct experiments on six benchmark data sets from computer vision, signal processing and natural language processing.
arXiv Detail & Related papers (2020-02-02T21:09:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.