Related papers: Interpreting Neural Networks through the Polytope Lens

Interpreting Neural Networks through the Polytope Lens

URL: http://arxiv.org/abs/2211.12312v1
Date: Tue, 22 Nov 2022 15:03:48 GMT
Title: Interpreting Neural Networks through the Polytope Lens
Authors: Sid Black, Lee Sharkey, Leo Grinsztajn, Eric Winsor, Dan Braun, Jacob Merizian, Kip Parker, Carlos Ram\'on Guevara, Beren Millidge, Gabriel Alfour, Connor Leahy
Abstract summary: Mechanistic interpretability aims to explain what a neural network has learned at a nuts-and-bolts level. We study the way that piecewise linear activation functions partition the activation space into numerous discrete polytopes. The polytope lens makes concrete predictions about the behavior of neural networks.
Score: 0.2359380460160535
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mechanistic interpretability aims to explain what a neural network has learned at a nuts-and-bolts level. What are the fundamental primitives of neural network representations? Previous mechanistic descriptions have used individual neurons or their linear combinations to understand the representations a network has learned. But there are clues that neurons and their linear combinations are not the correct fundamental units of description: directions cannot describe how neural networks use nonlinearities to structure their representations. Moreover, many instances of individual neurons and their combinations are polysemantic (i.e. they have multiple unrelated meanings). Polysemanticity makes interpreting the network in terms of neurons or directions challenging since we can no longer assign a specific feature to a neural unit. In order to find a basic unit of description that does not suffer from these problems, we zoom in beyond just directions to study the way that piecewise linear activation functions (such as ReLU) partition the activation space into numerous discrete polytopes. We call this perspective the polytope lens. The polytope lens makes concrete predictions about the behavior of neural networks, which we evaluate through experiments on both convolutional image classifiers and language models. Specifically, we show that polytopes can be used to identify monosemantic regions of activation space (while directions are not in general monosemantic) and that the density of polytope boundaries reflect semantic boundaries. We also outline a vision for what mechanistic interpretability might look like through the polytope lens.

Related papers

Interpreting the Second-Order Effects of Neurons in CLIP [73.54377859089801]
We interpret the function of individual neurons in CLIP by automatically describing them using text. We present the "second-order lens", analyzing the effect flowing from a neuron through the later attention heads, directly to the output. Our results indicate that a scalable understanding of neurons can be used for model deception and for introducing new model capabilities.
arXiv Detail & Related papers (2024-06-06T17:59:52Z)
PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits [12.17671779091913]
We present a method for disentangling polysemanticity of any Deep Neural Network by decomposing a polysemantic neuron into multiple monosemantic "virtual" neurons. We demonstrate how our approach allows us to find and disentangle various polysemantic units of ResNet models trained on ImageNet.
arXiv Detail & Related papers (2024-04-09T16:54:19Z)
What Causes Polysemanticity? An Alternative Origin Story of Mixed Selectivity from Incidental Causes [14.623741848860037]
Polysemantic neurons -- neurons that activate for a set of unrelated features -- have been seen as a significant obstacle towards interpretability of task-optimized deep networks. We show that polysemanticity can arise incidentally, even when there are ample neurons to represent all features in the data.
arXiv Detail & Related papers (2023-12-05T19:29:54Z)
Identifying Interpretable Visual Features in Artificial and Biological Neural Systems [3.604033202771937]
Single neurons in neural networks are often interpretable in that they represent individual, intuitively meaningful features. Many neurons exhibit $textitmixed selectivity$, i.e., they represent multiple unrelated features. We propose an automated method for quantifying visual interpretability and an approach for finding meaningful directions in network activation space.
arXiv Detail & Related papers (2023-10-17T17:41:28Z)
Sparse Autoencoders Find Highly Interpretable Features in Language Models [0.0]
Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. We use sparse autoencoders to reconstruct the internal activations of a language model. Our method may serve as a foundation for future mechanistic interpretability work.
arXiv Detail & Related papers (2023-09-15T17:56:55Z)
Disentangling Neuron Representations with Concept Vectors [0.0]
The main contribution of this paper is a method to disentangle polysemantic neurons into concept vectors encapsulating distinct features. Our evaluations show that the concept vectors found encode coherent, human-understandable features.
arXiv Detail & Related papers (2023-04-19T14:55:31Z)
Natural Language Descriptions of Deep Visual Features [50.270035018478666]
We introduce a procedure that automatically labels neurons with open-ended, compositional, natural language descriptions. We use MILAN for analysis, characterizing the distribution and importance of neurons selective for attribute, category, and relational information in vision models. We also use MILAN for auditing, surfacing neurons sensitive to protected categories like race and gender in models trained on datasets intended to obscure these features.
arXiv Detail & Related papers (2022-01-26T18:48:02Z)
The Separation Capacity of Random Neural Networks [78.25060223808936]
We show that a sufficiently large two-layer ReLU-network with standard Gaussian weights and uniformly distributed biases can solve this problem with high probability. We quantify the relevant structure of the data in terms of a novel notion of mutual complexity.
arXiv Detail & Related papers (2021-07-31T10:25:26Z)
Neuron-based explanations of neural networks sacrifice completeness and interpretability [67.53271920386851]
We show that for AlexNet pretrained on ImageNet, neuron-based explanation methods sacrifice both completeness and interpretability. We show the most important principal components provide more complete and interpretable explanations than the most important neurons. Our findings suggest that explanation methods for networks like AlexNet should avoid using neurons as a basis for embeddings.
arXiv Detail & Related papers (2020-11-05T21:26:03Z)
Compositional Explanations of Neurons [52.71742655312625]
We describe a procedure for explaining neurons in deep representations by identifying compositional logical concepts. We use this procedure to answer several questions on interpretability in models for vision and natural language processing.
arXiv Detail & Related papers (2020-06-24T20:37:05Z)
Deep Polynomial Neural Networks [77.70761658507507]
$Pi$Nets are a new class of function approximators based on expansions. $Pi$Nets produce state-the-art results in three challenging tasks, i.e. image generation, face verification and 3D mesh representation learning.
arXiv Detail & Related papers (2020-06-20T16:23:32Z)
Non-linear Neurons with Human-like Apical Dendrite Activations [81.18416067005538]
We show that a standard neuron followed by our novel apical dendrite activation (ADA) can learn the XOR logical function with 100% accuracy. We conduct experiments on six benchmark data sets from computer vision, signal processing and natural language processing.
arXiv Detail & Related papers (2020-02-02T21:09:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.