Related papers: Engineering Monosemanticity in Toy Models

Engineering Monosemanticity in Toy Models

URL: http://arxiv.org/abs/2211.09169v1
Date: Wed, 16 Nov 2022 19:32:43 GMT
Title: Engineering Monosemanticity in Toy Models
Authors: Adam S. Jermyn, Nicholas Schiefer, and Evan Hubinger
Abstract summary: In some neural networks, individual neurons correspond to natural features'' in the input. We find that models can be made more monosemantic without increasing the loss by just changing which local minimum the training process finds. We are able to mechanistically interpret these models, including the residual polysemantic neurons, and uncover a simple yet surprising algorithm.
Score: 0.1474723404975345
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In some neural networks, individual neurons correspond to natural ``features'' in the input. Such \emph{monosemantic} neurons are of great help in interpretability studies, as they can be cleanly understood. In this work we report preliminary attempts to engineer monosemanticity in toy models. We find that models can be made more monosemantic without increasing the loss by just changing which local minimum the training process finds. More monosemantic loss minima have moderate negative biases, and we are able to use this fact to engineer highly monosemantic models. We are able to mechanistically interpret these models, including the residual polysemantic neurons, and uncover a simple yet surprising algorithm. Finally, we find that providing models with more neurons per layer makes the models more monosemantic, albeit at increased computational cost. These findings point to a number of new questions and avenues for engineering monosemanticity, which we intend to study these in future work.

Related papers

Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness [68.69369585600698]
Deep learning models often suffer from a lack of interpretability due to polysemanticity. Recent advances in monosemanticity, where neurons correspond to consistent and distinct semantics, have significantly improved interpretability. We show that monosemantic features not only enhance interpretability but also bring concrete gains in model performance.
arXiv Detail & Related papers (2024-10-27T18:03:20Z)
Don't Cut Corners: Exact Conditions for Modularity in Biologically Inspired Representations [52.48094670415497]
We develop a theory of when biologically inspired representations modularise with respect to source variables (sources) We derive necessary and sufficient conditions on a sample of sources that determine whether the neurons in an optimal biologically-inspired linear autoencoder modularise. Our theory applies to any dataset, extending far beyond the case of statistical independence studied in previous work.
arXiv Detail & Related papers (2024-10-08T17:41:37Z)
Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a Feature Decorrelation Perspective [30.290777756014748]
A monosemantic neuron is dedicated to a single and specific concept, which forms a one-to-one correlation between neurons and concepts. Despite extensive research in monosemanticity probing, it remains unclear whether monosemanticity is beneficial or harmful to model capacity.
arXiv Detail & Related papers (2024-06-25T22:51:08Z)
Learning from Emergence: A Study on Proactively Inhibiting the Monosemantic Neurons of Artificial Neural Networks [10.390475063385756]
We propose a new metric to measure the monosemanticity of neurons with the guarantee of efficiency for online computation. We validate our conjecture that monosemanticity brings about performance change at different model scales.
arXiv Detail & Related papers (2023-12-17T14:42:46Z)
A Pseudo-Semantic Loss for Autoregressive Models with Logical Constraints [87.08677547257733]
Neuro-symbolic AI bridges the gap between purely symbolic and neural approaches to learning. We show how to maximize the likelihood of a symbolic constraint w.r.t the neural network's output distribution. We also evaluate our approach on Sudoku and shortest-path prediction cast as autoregressive generation.
arXiv Detail & Related papers (2023-12-06T20:58:07Z)
On Modifying a Neural Network's Perception [3.42658286826597]
We propose a method which allows one to modify what an artificial neural network is perceiving regarding specific human-defined concepts. We test the proposed method on different models, assessing whether the performed manipulations are well interpreted by the models, and analyzing how they react to them.
arXiv Detail & Related papers (2023-03-05T12:09:37Z)
Cross-Model Comparative Loss for Enhancing Neuronal Utility in Language Understanding [82.46024259137823]
We propose a cross-model comparative loss for a broad range of tasks. We demonstrate the universal effectiveness of comparative loss through extensive experiments on 14 datasets from 3 distinct NLU tasks.
arXiv Detail & Related papers (2023-01-10T03:04:27Z)
Interpreting Neural Networks through the Polytope Lens [0.2359380460160535]
Mechanistic interpretability aims to explain what a neural network has learned at a nuts-and-bolts level. We study the way that piecewise linear activation functions partition the activation space into numerous discrete polytopes. The polytope lens makes concrete predictions about the behavior of neural networks.
arXiv Detail & Related papers (2022-11-22T15:03:48Z)
EINNs: Epidemiologically-Informed Neural Networks [75.34199997857341]
We introduce a new class of physics-informed neural networks-EINN-crafted for epidemic forecasting. We investigate how to leverage both the theoretical flexibility provided by mechanistic models as well as the data-driven expressability afforded by AI models.
arXiv Detail & Related papers (2022-02-21T18:59:03Z)
The Neural Coding Framework for Learning Generative Models [91.0357317238509]
We propose a novel neural generative model inspired by the theory of predictive processing in the brain. In a similar way, artificial neurons in our generative model predict what neighboring neurons will do, and adjust their parameters based on how well the predictions matched reality.
arXiv Detail & Related papers (2020-12-07T01:20:38Z)
What is Learned in Visually Grounded Neural Syntax Acquisition [118.6461386981381]
We consider the case study of the Visually Grounded Neural Syntax Learner. By constructing simplified versions of the model, we isolate the core factors that yield the model's strong performance. We find that a simple lexical signal of noun concreteness plays the main role in the model's predictions.
arXiv Detail & Related papers (2020-05-04T17:32:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.