Engineering Monosemanticity in Toy Models
- URL: http://arxiv.org/abs/2211.09169v1
- Date: Wed, 16 Nov 2022 19:32:43 GMT
- Title: Engineering Monosemanticity in Toy Models
- Authors: Adam S. Jermyn, Nicholas Schiefer, and Evan Hubinger
- Abstract summary: In some neural networks, individual neurons correspond to natural features'' in the input.
We find that models can be made more monosemantic without increasing the loss by just changing which local minimum the training process finds.
We are able to mechanistically interpret these models, including the residual polysemantic neurons, and uncover a simple yet surprising algorithm.
- Score: 0.1474723404975345
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In some neural networks, individual neurons correspond to natural
``features'' in the input. Such \emph{monosemantic} neurons are of great help
in interpretability studies, as they can be cleanly understood. In this work we
report preliminary attempts to engineer monosemanticity in toy models. We find
that models can be made more monosemantic without increasing the loss by just
changing which local minimum the training process finds. More monosemantic loss
minima have moderate negative biases, and we are able to use this fact to
engineer highly monosemantic models. We are able to mechanistically interpret
these models, including the residual polysemantic neurons, and uncover a simple
yet surprising algorithm. Finally, we find that providing models with more
neurons per layer makes the models more monosemantic, albeit at increased
computational cost. These findings point to a number of new questions and
avenues for engineering monosemanticity, which we intend to study these in
future work.
Related papers
- Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness [68.69369585600698]
Deep learning models often suffer from a lack of interpretability due to polysemanticity.
Recent advances in monosemanticity, where neurons correspond to consistent and distinct semantics, have significantly improved interpretability.
We show that monosemantic features not only enhance interpretability but also bring concrete gains in model performance.
arXiv Detail & Related papers (2024-10-27T18:03:20Z) - Don't Cut Corners: Exact Conditions for Modularity in Biologically Inspired Representations [52.48094670415497]
We develop a theory of when biologically inspired representations modularise with respect to source variables (sources)
We derive necessary and sufficient conditions on a sample of sources that determine whether the neurons in an optimal biologically-inspired linear autoencoder modularise.
Our theory applies to any dataset, extending far beyond the case of statistical independence studied in previous work.
arXiv Detail & Related papers (2024-10-08T17:41:37Z) - Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a Feature Decorrelation Perspective [30.290777756014748]
A monosemantic neuron is dedicated to a single and specific concept, which forms a one-to-one correlation between neurons and concepts.
Despite extensive research in monosemanticity probing, it remains unclear whether monosemanticity is beneficial or harmful to model capacity.
arXiv Detail & Related papers (2024-06-25T22:51:08Z) - Learning from Emergence: A Study on Proactively Inhibiting the Monosemantic Neurons of Artificial Neural Networks [10.390475063385756]
We propose a new metric to measure the monosemanticity of neurons with the guarantee of efficiency for online computation.
We validate our conjecture that monosemanticity brings about performance change at different model scales.
arXiv Detail & Related papers (2023-12-17T14:42:46Z) - A Pseudo-Semantic Loss for Autoregressive Models with Logical
Constraints [87.08677547257733]
Neuro-symbolic AI bridges the gap between purely symbolic and neural approaches to learning.
We show how to maximize the likelihood of a symbolic constraint w.r.t the neural network's output distribution.
We also evaluate our approach on Sudoku and shortest-path prediction cast as autoregressive generation.
arXiv Detail & Related papers (2023-12-06T20:58:07Z) - On Modifying a Neural Network's Perception [3.42658286826597]
We propose a method which allows one to modify what an artificial neural network is perceiving regarding specific human-defined concepts.
We test the proposed method on different models, assessing whether the performed manipulations are well interpreted by the models, and analyzing how they react to them.
arXiv Detail & Related papers (2023-03-05T12:09:37Z) - Cross-Model Comparative Loss for Enhancing Neuronal Utility in Language
Understanding [82.46024259137823]
We propose a cross-model comparative loss for a broad range of tasks.
We demonstrate the universal effectiveness of comparative loss through extensive experiments on 14 datasets from 3 distinct NLU tasks.
arXiv Detail & Related papers (2023-01-10T03:04:27Z) - Interpreting Neural Networks through the Polytope Lens [0.2359380460160535]
Mechanistic interpretability aims to explain what a neural network has learned at a nuts-and-bolts level.
We study the way that piecewise linear activation functions partition the activation space into numerous discrete polytopes.
The polytope lens makes concrete predictions about the behavior of neural networks.
arXiv Detail & Related papers (2022-11-22T15:03:48Z) - EINNs: Epidemiologically-Informed Neural Networks [75.34199997857341]
We introduce a new class of physics-informed neural networks-EINN-crafted for epidemic forecasting.
We investigate how to leverage both the theoretical flexibility provided by mechanistic models as well as the data-driven expressability afforded by AI models.
arXiv Detail & Related papers (2022-02-21T18:59:03Z) - The Neural Coding Framework for Learning Generative Models [91.0357317238509]
We propose a novel neural generative model inspired by the theory of predictive processing in the brain.
In a similar way, artificial neurons in our generative model predict what neighboring neurons will do, and adjust their parameters based on how well the predictions matched reality.
arXiv Detail & Related papers (2020-12-07T01:20:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.