Compositional Explanations of Neurons
- URL: http://arxiv.org/abs/2006.14032v2
- Date: Tue, 2 Feb 2021 23:46:51 GMT
- Title: Compositional Explanations of Neurons
- Authors: Jesse Mu, Jacob Andreas
- Abstract summary: We describe a procedure for explaining neurons in deep representations by identifying compositional logical concepts.
We use this procedure to answer several questions on interpretability in models for vision and natural language processing.
- Score: 52.71742655312625
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We describe a procedure for explaining neurons in deep representations by
identifying compositional logical concepts that closely approximate neuron
behavior. Compared to prior work that uses atomic labels as explanations,
analyzing neurons compositionally allows us to more precisely and expressively
characterize their behavior. We use this procedure to answer several questions
on interpretability in models for vision and natural language processing.
First, we examine the kinds of abstractions learned by neurons. In image
classification, we find that many neurons learn highly abstract but
semantically coherent visual concepts, while other polysemantic neurons detect
multiple unrelated features; in natural language inference (NLI), neurons learn
shallow lexical heuristics from dataset biases. Second, we see whether
compositional explanations give us insight into model performance: vision
neurons that detect human-interpretable concepts are positively correlated with
task performance, while NLI neurons that fire for shallow heuristics are
negatively correlated with task performance. Finally, we show how compositional
explanations provide an accessible way for end users to produce simple
"copy-paste" adversarial examples that change model behavior in predictable
ways.
Related papers
- Interpreting the Second-Order Effects of Neurons in CLIP [73.54377859089801]
We interpret the function of individual neurons in CLIP by automatically describing them using text.
We present the "second-order lens", analyzing the effect flowing from a neuron through the later attention heads, directly to the output.
Our results indicate that a scalable understanding of neurons can be used for model deception and for introducing new model capabilities.
arXiv Detail & Related papers (2024-06-06T17:59:52Z) - Investigating the Encoding of Words in BERT's Neurons using Feature
Textualization [11.943486282441143]
We propose a technique to produce representations of neurons in embedding word space.
We find that the produced representations can provide insights about the encoded knowledge in individual neurons.
arXiv Detail & Related papers (2023-11-14T15:21:49Z) - Identifying Interpretable Visual Features in Artificial and Biological
Neural Systems [3.604033202771937]
Single neurons in neural networks are often interpretable in that they represent individual, intuitively meaningful features.
Many neurons exhibit $textitmixed selectivity$, i.e., they represent multiple unrelated features.
We propose an automated method for quantifying visual interpretability and an approach for finding meaningful directions in network activation space.
arXiv Detail & Related papers (2023-10-17T17:41:28Z) - Neuron to Graph: Interpreting Language Model Neurons at Scale [8.32093320910416]
This paper introduces a novel automated approach designed to scale interpretability techniques across a vast array of neurons within Large Language Models.
We propose Neuron to Graph (N2G), an innovative tool that automatically extracts a neuron's behaviour from the dataset it was trained on and translates it into an interpretable graph.
arXiv Detail & Related papers (2023-05-31T14:44:33Z) - Disentangling Neuron Representations with Concept Vectors [0.0]
The main contribution of this paper is a method to disentangle polysemantic neurons into concept vectors encapsulating distinct features.
Our evaluations show that the concept vectors found encode coherent, human-understandable features.
arXiv Detail & Related papers (2023-04-19T14:55:31Z) - Natural Language Descriptions of Deep Visual Features [50.270035018478666]
We introduce a procedure that automatically labels neurons with open-ended, compositional, natural language descriptions.
We use MILAN for analysis, characterizing the distribution and importance of neurons selective for attribute, category, and relational information in vision models.
We also use MILAN for auditing, surfacing neurons sensitive to protected categories like race and gender in models trained on datasets intended to obscure these features.
arXiv Detail & Related papers (2022-01-26T18:48:02Z) - Emergence of Machine Language: Towards Symbolic Intelligence with Neural
Networks [73.94290462239061]
We propose to combine symbolism and connectionism principles by using neural networks to derive a discrete representation.
By designing an interactive environment and task, we demonstrated that machines could generate a spontaneous, flexible, and semantic language.
arXiv Detail & Related papers (2022-01-14T14:54:58Z) - Drop, Swap, and Generate: A Self-Supervised Approach for Generating
Neural Activity [33.06823702945747]
We introduce a novel unsupervised approach for learning disentangled representations of neural activity called Swap-VAE.
Our approach combines a generative modeling framework with an instance-specific alignment loss.
We show that it is possible to build representations that disentangle neural datasets along relevant latent dimensions linked to behavior.
arXiv Detail & Related papers (2021-11-03T16:39:43Z) - Compositional Processing Emerges in Neural Networks Solving Math
Problems [100.80518350845668]
Recent progress in artificial neural networks has shown that when large models are trained on enough linguistic data, grammatical structure emerges in their representations.
We extend this work to the domain of mathematical reasoning, where it is possible to formulate precise hypotheses about how meanings should be composed.
Our work shows that neural networks are not only able to infer something about the structured relationships implicit in their training data, but can also deploy this knowledge to guide the composition of individual meanings into composite wholes.
arXiv Detail & Related papers (2021-05-19T07:24:42Z) - Non-linear Neurons with Human-like Apical Dendrite Activations [81.18416067005538]
We show that a standard neuron followed by our novel apical dendrite activation (ADA) can learn the XOR logical function with 100% accuracy.
We conduct experiments on six benchmark data sets from computer vision, signal processing and natural language processing.
arXiv Detail & Related papers (2020-02-02T21:09:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.