Interpreting Neural Networks through the Polytope Lens
- URL: http://arxiv.org/abs/2211.12312v1
- Date: Tue, 22 Nov 2022 15:03:48 GMT
- Title: Interpreting Neural Networks through the Polytope Lens
- Authors: Sid Black, Lee Sharkey, Leo Grinsztajn, Eric Winsor, Dan Braun, Jacob
Merizian, Kip Parker, Carlos Ram\'on Guevara, Beren Millidge, Gabriel Alfour,
Connor Leahy
- Abstract summary: Mechanistic interpretability aims to explain what a neural network has learned at a nuts-and-bolts level.
We study the way that piecewise linear activation functions partition the activation space into numerous discrete polytopes.
The polytope lens makes concrete predictions about the behavior of neural networks.
- Score: 0.2359380460160535
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mechanistic interpretability aims to explain what a neural network has
learned at a nuts-and-bolts level. What are the fundamental primitives of
neural network representations? Previous mechanistic descriptions have used
individual neurons or their linear combinations to understand the
representations a network has learned. But there are clues that neurons and
their linear combinations are not the correct fundamental units of description:
directions cannot describe how neural networks use nonlinearities to structure
their representations. Moreover, many instances of individual neurons and their
combinations are polysemantic (i.e. they have multiple unrelated meanings).
Polysemanticity makes interpreting the network in terms of neurons or
directions challenging since we can no longer assign a specific feature to a
neural unit. In order to find a basic unit of description that does not suffer
from these problems, we zoom in beyond just directions to study the way that
piecewise linear activation functions (such as ReLU) partition the activation
space into numerous discrete polytopes. We call this perspective the polytope
lens. The polytope lens makes concrete predictions about the behavior of neural
networks, which we evaluate through experiments on both convolutional image
classifiers and language models. Specifically, we show that polytopes can be
used to identify monosemantic regions of activation space (while directions are
not in general monosemantic) and that the density of polytope boundaries
reflect semantic boundaries. We also outline a vision for what mechanistic
interpretability might look like through the polytope lens.
Related papers
- Interpreting the Second-Order Effects of Neurons in CLIP [73.54377859089801]
We interpret the function of individual neurons in CLIP by automatically describing them using text.
We present the "second-order lens", analyzing the effect flowing from a neuron through the later attention heads, directly to the output.
Our results indicate that a scalable understanding of neurons can be used for model deception and for introducing new model capabilities.
arXiv Detail & Related papers (2024-06-06T17:59:52Z) - PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits [12.17671779091913]
We present a method for disentangling polysemanticity of any Deep Neural Network by decomposing a polysemantic neuron into multiple monosemantic "virtual" neurons.
We demonstrate how our approach allows us to find and disentangle various polysemantic units of ResNet models trained on ImageNet.
arXiv Detail & Related papers (2024-04-09T16:54:19Z) - What Causes Polysemanticity? An Alternative Origin Story of Mixed
Selectivity from Incidental Causes [14.623741848860037]
Polysemantic neurons -- neurons that activate for a set of unrelated features -- have been seen as a significant obstacle towards interpretability of task-optimized deep networks.
We show that polysemanticity can arise incidentally, even when there are ample neurons to represent all features in the data.
arXiv Detail & Related papers (2023-12-05T19:29:54Z) - Identifying Interpretable Visual Features in Artificial and Biological
Neural Systems [3.604033202771937]
Single neurons in neural networks are often interpretable in that they represent individual, intuitively meaningful features.
Many neurons exhibit $textitmixed selectivity$, i.e., they represent multiple unrelated features.
We propose an automated method for quantifying visual interpretability and an approach for finding meaningful directions in network activation space.
arXiv Detail & Related papers (2023-10-17T17:41:28Z) - Sparse Autoencoders Find Highly Interpretable Features in Language
Models [0.0]
Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally.
We use sparse autoencoders to reconstruct the internal activations of a language model.
Our method may serve as a foundation for future mechanistic interpretability work.
arXiv Detail & Related papers (2023-09-15T17:56:55Z) - Disentangling Neuron Representations with Concept Vectors [0.0]
The main contribution of this paper is a method to disentangle polysemantic neurons into concept vectors encapsulating distinct features.
Our evaluations show that the concept vectors found encode coherent, human-understandable features.
arXiv Detail & Related papers (2023-04-19T14:55:31Z) - Natural Language Descriptions of Deep Visual Features [50.270035018478666]
We introduce a procedure that automatically labels neurons with open-ended, compositional, natural language descriptions.
We use MILAN for analysis, characterizing the distribution and importance of neurons selective for attribute, category, and relational information in vision models.
We also use MILAN for auditing, surfacing neurons sensitive to protected categories like race and gender in models trained on datasets intended to obscure these features.
arXiv Detail & Related papers (2022-01-26T18:48:02Z) - The Separation Capacity of Random Neural Networks [78.25060223808936]
We show that a sufficiently large two-layer ReLU-network with standard Gaussian weights and uniformly distributed biases can solve this problem with high probability.
We quantify the relevant structure of the data in terms of a novel notion of mutual complexity.
arXiv Detail & Related papers (2021-07-31T10:25:26Z) - Compositional Explanations of Neurons [52.71742655312625]
We describe a procedure for explaining neurons in deep representations by identifying compositional logical concepts.
We use this procedure to answer several questions on interpretability in models for vision and natural language processing.
arXiv Detail & Related papers (2020-06-24T20:37:05Z) - Deep Polynomial Neural Networks [77.70761658507507]
$Pi$Nets are a new class of function approximators based on expansions.
$Pi$Nets produce state-the-art results in three challenging tasks, i.e. image generation, face verification and 3D mesh representation learning.
arXiv Detail & Related papers (2020-06-20T16:23:32Z) - Non-linear Neurons with Human-like Apical Dendrite Activations [81.18416067005538]
We show that a standard neuron followed by our novel apical dendrite activation (ADA) can learn the XOR logical function with 100% accuracy.
We conduct experiments on six benchmark data sets from computer vision, signal processing and natural language processing.
arXiv Detail & Related papers (2020-02-02T21:09:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.