Related papers: Codebook Features: Sparse and Discrete Interpretability for Neural Networks

Codebook Features: Sparse and Discrete Interpretability for Neural Networks

URL: http://arxiv.org/abs/2310.17230v1
Date: Thu, 26 Oct 2023 08:28:48 GMT
Title: Codebook Features: Sparse and Discrete Interpretability for Neural Networks
Authors: Alex Tamkin, Mohammad Taufeeque, Noah D. Goodman
Abstract summary: We explore whether we can train neural networks to have hidden states that are sparse, discrete, and more interpretable. Codebook features are produced by finetuning neural networks with vector quantization bottlenecks at each layer. We find that neural networks can operate under this extreme bottleneck with only modest degradation in performance.
Score: 43.06828312515959
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding neural networks is challenging in part because of the dense, continuous nature of their hidden states. We explore whether we can train neural networks to have hidden states that are sparse, discrete, and more interpretable by quantizing their continuous features into what we call codebook features. Codebook features are produced by finetuning neural networks with vector quantization bottlenecks at each layer, producing a network whose hidden features are the sum of a small number of discrete vector codes chosen from a larger codebook. Surprisingly, we find that neural networks can operate under this extreme bottleneck with only modest degradation in performance. This sparse, discrete bottleneck also provides an intuitive way of controlling neural network behavior: first, find codes that activate when the desired behavior is present, then activate those same codes during generation to elicit that behavior. We validate our approach by training codebook Transformers on several different datasets. First, we explore a finite state machine dataset with far more hidden states than neurons. In this setting, our approach overcomes the superposition problem by assigning states to distinct codes, and we find that we can make the neural network behave as if it is in a different state by activating the code for that state. Second, we train Transformer language models with up to 410M parameters on two natural language datasets. We identify codes in these models representing diverse, disentangled concepts (ranging from negative emotions to months of the year) and find that we can guide the model to generate different topics by activating the appropriate codes during inference. Overall, codebook features appear to be a promising unit of analysis and control for neural networks and interpretability. Our codebase and models are open-sourced at https://github.com/taufeeque9/codebook-features.

Related papers

A Sparse Quantized Hopfield Network for Online-Continual Memory [0.0]
Nervous systems learn online where a stream of noisy data points are presented in a non-independent, identically distributed (non-i.i.d.) way. Deep networks, on the other hand, typically use non-local learning algorithms and are trained in an offline, non-noisy, i.i.d. setting. We implement this kind of model in a novel neural network called the Sparse Quantized Hopfield Network (SQHN)
arXiv Detail & Related papers (2023-07-27T17:46:17Z)
Permutation Equivariant Neural Functionals [92.0667671999604]
This work studies the design of neural networks that can process the weights or gradients of other neural networks. We focus on the permutation symmetries that arise in the weights of deep feedforward networks because hidden layer neurons have no inherent order. In our experiments, we find that permutation equivariant neural functionals are effective on a diverse set of tasks.
arXiv Detail & Related papers (2023-02-27T18:52:38Z)
Variable Bitrate Neural Fields [75.24672452527795]
We present a dictionary method for compressing feature grids, reducing their memory consumption by up to 100x. We formulate the dictionary optimization as a vector-quantized auto-decoder problem which lets us learn end-to-end discrete neural representations in a space where no direct supervision is available.
arXiv Detail & Related papers (2022-06-15T17:58:34Z)
How and what to learn:The modes of machine learning [7.085027463060304]
We propose a new approach, namely the weight pathway analysis (WPA), to study the mechanism of multilayer neural networks. WPA shows that a neural network stores and utilizes information in a "holographic" way, that is, the network encodes all training samples in a coherent structure. It is found that hidden-layer neurons self-organize into different classes in the later stages of the learning process.
arXiv Detail & Related papers (2022-02-28T14:39:06Z)
Leveraging Sparse Linear Layers for Debuggable Deep Networks [86.94586860037049]
We show how fitting sparse linear models over learned deep feature representations can lead to more debuggable neural networks. The resulting sparse explanations can help to identify spurious correlations, explain misclassifications, and diagnose model biases in vision and language tasks.
arXiv Detail & Related papers (2021-05-11T08:15:25Z)
Reservoir Memory Machines as Neural Computers [70.5993855765376]
Differentiable neural computers extend artificial neural networks with an explicit memory without interference. We achieve some of the computational capabilities of differentiable neural computers with a model that can be trained very efficiently.
arXiv Detail & Related papers (2020-09-14T12:01:30Z)
Binary autoencoder with random binary weights [0.0]
It is shown that the sparse activation of the hidden layer arises naturally in order to preserve information between layers. With a large enough hidden layer, it is possible to get zero reconstruction error for any input just by varying the thresholds of neurons. The model is similar to an olfactory perception system of a fruit fly, and the presented theoretical results give useful insights toward understanding more complex neural networks.
arXiv Detail & Related papers (2020-04-30T12:13:19Z)
On Tractable Representations of Binary Neural Networks [23.50970665150779]
We consider the compilation of a binary neural network's decision function into tractable representations such as Ordered Binary Decision Diagrams (OBDDs) and Sentential Decision Diagrams (SDDs) In experiments, we show that it is feasible to obtain compact representations of neural networks as SDDs.
arXiv Detail & Related papers (2020-04-05T03:21:26Z)
Non-linear Neurons with Human-like Apical Dendrite Activations [81.18416067005538]
We show that a standard neuron followed by our novel apical dendrite activation (ADA) can learn the XOR logical function with 100% accuracy. We conduct experiments on six benchmark data sets from computer vision, signal processing and natural language processing.
arXiv Detail & Related papers (2020-02-02T21:09:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.