Engineering Monosemanticity in Toy Models
- URL: http://arxiv.org/abs/2211.09169v1
- Date: Wed, 16 Nov 2022 19:32:43 GMT
- Title: Engineering Monosemanticity in Toy Models
- Authors: Adam S. Jermyn, Nicholas Schiefer, and Evan Hubinger
- Abstract summary: In some neural networks, individual neurons correspond to natural features'' in the input.
We find that models can be made more monosemantic without increasing the loss by just changing which local minimum the training process finds.
We are able to mechanistically interpret these models, including the residual polysemantic neurons, and uncover a simple yet surprising algorithm.
- Score: 0.1474723404975345
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In some neural networks, individual neurons correspond to natural
``features'' in the input. Such \emph{monosemantic} neurons are of great help
in interpretability studies, as they can be cleanly understood. In this work we
report preliminary attempts to engineer monosemanticity in toy models. We find
that models can be made more monosemantic without increasing the loss by just
changing which local minimum the training process finds. More monosemantic loss
minima have moderate negative biases, and we are able to use this fact to
engineer highly monosemantic models. We are able to mechanistically interpret
these models, including the residual polysemantic neurons, and uncover a simple
yet surprising algorithm. Finally, we find that providing models with more
neurons per layer makes the models more monosemantic, albeit at increased
computational cost. These findings point to a number of new questions and
avenues for engineering monosemanticity, which we intend to study these in
future work.
Related papers
- Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness [68.69369585600698]
Deep learning models often suffer from a lack of interpretability due to polysemanticity.
Recent advances in monosemanticity, where neurons correspond to consistent and distinct semantics, have significantly improved interpretability.
We show that monosemantic features not only enhance interpretability but also bring concrete gains in model performance.
arXiv Detail & Related papers (2024-10-27T18:03:20Z) - Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a Feature Decorrelation Perspective [30.290777756014748]
A monosemantic neuron is dedicated to a single and specific concept, which forms a one-to-one correlation between neurons and concepts.
Despite extensive research in monosemanticity probing, it remains unclear whether monosemanticity is beneficial or harmful to model capacity.
arXiv Detail & Related papers (2024-06-25T22:51:08Z) - Learning from Emergence: A Study on Proactively Inhibiting the Monosemantic Neurons of Artificial Neural Networks [10.390475063385756]
We propose a new metric to measure the monosemanticity of neurons with the guarantee of efficiency for online computation.
We validate our conjecture that monosemanticity brings about performance change at different model scales.
arXiv Detail & Related papers (2023-12-17T14:42:46Z) - A Pseudo-Semantic Loss for Autoregressive Models with Logical
Constraints [87.08677547257733]
Neuro-symbolic AI bridges the gap between purely symbolic and neural approaches to learning.
We show how to maximize the likelihood of a symbolic constraint w.r.t the neural network's output distribution.
We also evaluate our approach on Sudoku and shortest-path prediction cast as autoregressive generation.
arXiv Detail & Related papers (2023-12-06T20:58:07Z) - On Modifying a Neural Network's Perception [3.42658286826597]
We propose a method which allows one to modify what an artificial neural network is perceiving regarding specific human-defined concepts.
We test the proposed method on different models, assessing whether the performed manipulations are well interpreted by the models, and analyzing how they react to them.
arXiv Detail & Related papers (2023-03-05T12:09:37Z) - Cross-Model Comparative Loss for Enhancing Neuronal Utility in Language
Understanding [82.46024259137823]
We propose a cross-model comparative loss for a broad range of tasks.
We demonstrate the universal effectiveness of comparative loss through extensive experiments on 14 datasets from 3 distinct NLU tasks.
arXiv Detail & Related papers (2023-01-10T03:04:27Z) - Interpreting Neural Networks through the Polytope Lens [0.2359380460160535]
Mechanistic interpretability aims to explain what a neural network has learned at a nuts-and-bolts level.
We study the way that piecewise linear activation functions partition the activation space into numerous discrete polytopes.
The polytope lens makes concrete predictions about the behavior of neural networks.
arXiv Detail & Related papers (2022-11-22T15:03:48Z) - EINNs: Epidemiologically-Informed Neural Networks [75.34199997857341]
We introduce a new class of physics-informed neural networks-EINN-crafted for epidemic forecasting.
We investigate how to leverage both the theoretical flexibility provided by mechanistic models as well as the data-driven expressability afforded by AI models.
arXiv Detail & Related papers (2022-02-21T18:59:03Z) - The Neural Coding Framework for Learning Generative Models [91.0357317238509]
We propose a novel neural generative model inspired by the theory of predictive processing in the brain.
In a similar way, artificial neurons in our generative model predict what neighboring neurons will do, and adjust their parameters based on how well the predictions matched reality.
arXiv Detail & Related papers (2020-12-07T01:20:38Z) - What is Learned in Visually Grounded Neural Syntax Acquisition [118.6461386981381]
We consider the case study of the Visually Grounded Neural Syntax Learner.
By constructing simplified versions of the model, we isolate the core factors that yield the model's strong performance.
We find that a simple lexical signal of noun concreteness plays the main role in the model's predictions.
arXiv Detail & Related papers (2020-05-04T17:32:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.