PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits
- URL: http://arxiv.org/abs/2404.06453v1
- Date: Tue, 9 Apr 2024 16:54:19 GMT
- Title: PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits
- Authors: Maximilian Dreyer, Erblina Purelku, Johanna Vielhaben, Wojciech Samek, Sebastian Lapuschkin,
- Abstract summary: We present a method for disentangling polysemanticity of any Deep Neural Network by decomposing a polysemantic neuron into multiple monosemantic "virtual" neurons.
We demonstrate how our approach allows us to find and disentangle various polysemantic units of ResNet models trained on ImageNet.
- Score: 12.17671779091913
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The field of mechanistic interpretability aims to study the role of individual neurons in Deep Neural Networks. Single neurons, however, have the capability to act polysemantically and encode for multiple (unrelated) features, which renders their interpretation difficult. We present a method for disentangling polysemanticity of any Deep Neural Network by decomposing a polysemantic neuron into multiple monosemantic "virtual" neurons. This is achieved by identifying the relevant sub-graph ("circuit") for each "pure" feature. We demonstrate how our approach allows us to find and disentangle various polysemantic units of ResNet models trained on ImageNet. While evaluating feature visualizations using CLIP, our method effectively disentangles representations, improving upon methods based on neuron activations. Our code is available at https://github.com/maxdreyer/PURE.
Related papers
- Interpreting the Second-Order Effects of Neurons in CLIP [73.54377859089801]
We interpret the function of individual neurons in CLIP by automatically describing them using text.
We present the "second-order lens", analyzing the effect flowing from a neuron through the later attention heads, directly to the output.
Our results indicate that a scalable understanding of neurons can be used for model deception and for introducing new model capabilities.
arXiv Detail & Related papers (2024-06-06T17:59:52Z) - Graph Neural Networks for Learning Equivariant Representations of Neural Networks [55.04145324152541]
We propose to represent neural networks as computational graphs of parameters.
Our approach enables a single model to encode neural computational graphs with diverse architectures.
We showcase the effectiveness of our method on a wide range of tasks, including classification and editing of implicit neural representations.
arXiv Detail & Related papers (2024-03-18T18:01:01Z) - Manipulating Feature Visualizations with Gradient Slingshots [54.31109240020007]
We introduce a novel method for manipulating Feature Visualization (FV) without significantly impacting the model's decision-making process.
We evaluate the effectiveness of our method on several neural network models and demonstrate its capabilities to hide the functionality of arbitrarily chosen neurons.
arXiv Detail & Related papers (2024-01-11T18:57:17Z) - Identifying Interpretable Visual Features in Artificial and Biological
Neural Systems [3.604033202771937]
Single neurons in neural networks are often interpretable in that they represent individual, intuitively meaningful features.
Many neurons exhibit $textitmixed selectivity$, i.e., they represent multiple unrelated features.
We propose an automated method for quantifying visual interpretability and an approach for finding meaningful directions in network activation space.
arXiv Detail & Related papers (2023-10-17T17:41:28Z) - Sparse Autoencoders Find Highly Interpretable Features in Language
Models [0.0]
Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally.
We use sparse autoencoders to reconstruct the internal activations of a language model.
Our method may serve as a foundation for future mechanistic interpretability work.
arXiv Detail & Related papers (2023-09-15T17:56:55Z) - Permutation Equivariant Neural Functionals [92.0667671999604]
This work studies the design of neural networks that can process the weights or gradients of other neural networks.
We focus on the permutation symmetries that arise in the weights of deep feedforward networks because hidden layer neurons have no inherent order.
In our experiments, we find that permutation equivariant neural functionals are effective on a diverse set of tasks.
arXiv Detail & Related papers (2023-02-27T18:52:38Z) - Interpreting Neural Networks through the Polytope Lens [0.2359380460160535]
Mechanistic interpretability aims to explain what a neural network has learned at a nuts-and-bolts level.
We study the way that piecewise linear activation functions partition the activation space into numerous discrete polytopes.
The polytope lens makes concrete predictions about the behavior of neural networks.
arXiv Detail & Related papers (2022-11-22T15:03:48Z) - Open- and Closed-Loop Neural Network Verification using Polynomial
Zonotopes [6.591194329459251]
We present a novel approach to efficiently compute tight non-contact activation functions.
In particular, we evaluate the input-output relation of each neuron by an approximation.
This results in a superior performance compared to other methods.
arXiv Detail & Related papers (2022-07-06T14:39:19Z) - Deep Polynomial Neural Networks [77.70761658507507]
$Pi$Nets are a new class of function approximators based on expansions.
$Pi$Nets produce state-the-art results in three challenging tasks, i.e. image generation, face verification and 3D mesh representation learning.
arXiv Detail & Related papers (2020-06-20T16:23:32Z) - Non-linear Neurons with Human-like Apical Dendrite Activations [81.18416067005538]
We show that a standard neuron followed by our novel apical dendrite activation (ADA) can learn the XOR logical function with 100% accuracy.
We conduct experiments on six benchmark data sets from computer vision, signal processing and natural language processing.
arXiv Detail & Related papers (2020-02-02T21:09:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.