Related papers: Tackling Polysemanticity with Neuron Embeddings

Tackling Polysemanticity with Neuron Embeddings

URL: http://arxiv.org/abs/2411.08166v1
Date: Tue, 12 Nov 2024 20:19:39 GMT
Title: Tackling Polysemanticity with Neuron Embeddings
Authors: Alex Foote,
Abstract summary: We present neuron embeddings, a representation that can be used to tackle polysemanticity. We apply our method to GPT2-small, and provide a UI for exploring the results.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present neuron embeddings, a representation that can be used to tackle polysemanticity by identifying the distinct semantic behaviours in a neuron's characteristic dataset examples, making downstream manual or automatic interpretation much easier. We apply our method to GPT2-small, and provide a UI for exploring the results. Neuron embeddings are computed using a model's internal representations and weights, making them domain and architecture agnostic and removing the risk of introducing external structure which may not reflect a model's actual computation. We describe how neuron embeddings can be used to measure neuron polysemanticity, which could be applied to better evaluate the efficacy of Sparse Auto-Encoders (SAEs).

Related papers

Deep learning with missing data [3.829599191332801]
We propose Pattern Embedded Neural Networks (PENNs), which can be applied in conjunction with any existing imputation technique. In addition to a neural network trained on the imputed data, PENNs pass the vectors of observation indicators through a second neural network to provide a compact representation. The outputs are then combined in a third neural network to produce final predictions.
arXiv Detail & Related papers (2025-04-21T18:57:36Z)
Interpreting the Second-Order Effects of Neurons in CLIP [73.54377859089801]
We interpret the function of individual neurons in CLIP by automatically describing them using text.<n>We present the "second-order lens", analyzing the effect flowing from a neuron through the later attention heads, directly to the output.<n>Our results indicate that an automated interpretation of neurons can be used for model deception and for introducing new model capabilities.
arXiv Detail & Related papers (2024-06-06T17:59:52Z)
PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits [12.17671779091913]
We present a method for disentangling polysemanticity of any Deep Neural Network by decomposing a polysemantic neuron into multiple monosemantic "virtual" neurons. We demonstrate how our approach allows us to find and disentangle various polysemantic units of ResNet models trained on ImageNet.
arXiv Detail & Related papers (2024-04-09T16:54:19Z)
Manipulating Feature Visualizations with Gradient Slingshots [54.31109240020007]
We introduce a novel method for manipulating Feature Visualization (FV) without significantly impacting the model's decision-making process. We evaluate the effectiveness of our method on several neural network models and demonstrate its capabilities to hide the functionality of arbitrarily chosen neurons.
arXiv Detail & Related papers (2024-01-11T18:57:17Z)
WaLiN-GUI: a graphical and auditory tool for neuron-based encoding [73.88751967207419]
Neuromorphic computing relies on spike-based, energy-efficient communication. We develop a tool to identify suitable configurations for neuron-based encoding of sample-based data into spike trains. The WaLiN-GUI is provided open source and with documentation.
arXiv Detail & Related papers (2023-10-25T20:34:08Z)
Neuron to Graph: Interpreting Language Model Neurons at Scale [8.32093320910416]
This paper introduces a novel automated approach designed to scale interpretability techniques across a vast array of neurons within Large Language Models. We propose Neuron to Graph (N2G), an innovative tool that automatically extracts a neuron's behaviour from the dataset it was trained on and translates it into an interpretable graph.
arXiv Detail & Related papers (2023-05-31T14:44:33Z)
N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models [0.0]
N2G is a tool which takes a neuron and its dataset examples, and automatically distills the neuron's behaviour on those examples to an interpretable graph. We use truncation and saliency methods to only present the important tokens, and augment the dataset examples with more diverse samples to better capture the extent of neuron behaviour. These graphs can be visualised to aid manual interpretation by researchers, but can also output token activations on text to compare to the neuron's ground truth activations for automatic validation.
arXiv Detail & Related papers (2023-04-22T19:06:13Z)
Neuroevolutionary algorithms driven by neuron coverage metrics for semi-supervised classification [60.60571130467197]
In some machine learning applications the availability of labeled instances for supervised classification is limited while unlabeled instances are abundant. We introduce neuroevolutionary approaches that exploit unlabeled instances by using neuron coverage metrics computed on the neural network architecture encoded by each candidate solution.
arXiv Detail & Related papers (2023-03-05T23:38:44Z)
The Neural Coding Framework for Learning Generative Models [91.0357317238509]
We propose a novel neural generative model inspired by the theory of predictive processing in the brain. In a similar way, artificial neurons in our generative model predict what neighboring neurons will do, and adjust their parameters based on how well the predictions matched reality.
arXiv Detail & Related papers (2020-12-07T01:20:38Z)
Learning identifiable and interpretable latent models of high-dimensional neural activity using pi-VAE [10.529943544385585]
We propose a method that integrates key ingredients from latent models and traditional neural encoding models. Our method, pi-VAE, is inspired by recent progress on identifiable variational auto-encoder. We validate pi-VAE using synthetic data, and apply it to analyze neurophysiological datasets from rat hippocampus and macaque motor cortex.
arXiv Detail & Related papers (2020-11-09T22:00:38Z)
Provably Efficient Neural Estimation of Structural Equation Model: An Adversarial Approach [144.21892195917758]
We study estimation in a class of generalized Structural equation models (SEMs) We formulate the linear operator equation as a min-max game, where both players are parameterized by neural networks (NNs), and learn the parameters of these neural networks using a gradient descent. For the first time we provide a tractable estimation procedure for SEMs based on NNs with provable convergence and without the need for sample splitting.
arXiv Detail & Related papers (2020-07-02T17:55:47Z)
Non-linear Neurons with Human-like Apical Dendrite Activations [81.18416067005538]
We show that a standard neuron followed by our novel apical dendrite activation (ADA) can learn the XOR logical function with 100% accuracy. We conduct experiments on six benchmark data sets from computer vision, signal processing and natural language processing.
arXiv Detail & Related papers (2020-02-02T21:09:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.