Language Model Circuits Are Sparse in the Neuron Basis
- URL: http://arxiv.org/abs/2601.22594v1
- Date: Fri, 30 Jan 2026 05:41:19 GMT
- Title: Language Model Circuits Are Sparse in the Neuron Basis
- Authors: Aryaman Arora, Zhengxuan Wu, Jacob Steinhardt, Sarah Schwettmann,
- Abstract summary: We show that textbfMLP neurons are as sparse a feature basis as SAEs.<n>This work advances automated interpretability of language models without additional training costs.
- Score: 50.460651620833055
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The high-level concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). Language model interpretability research has thus turned to techniques such as \textit{sparse autoencoders} (SAEs) to decompose the neuron basis into more interpretable units of model computation, for tasks such as \textit{circuit tracing}. However, not all neuron-based representations are uninterpretable. For the first time, we empirically show that \textbf{MLP neurons are as sparse a feature basis as SAEs}. We use this finding to develop an end-to-end pipeline for circuit tracing on the MLP neuron basis, which locates causal circuitry on a variety of tasks using gradient-based attribution. On a standard subject-verb agreement benchmark (Marks et al., 2025), a circuit of $\approx 10^2$ MLP neurons is enough to control model behaviour. On the multi-hop city $\to$ state $\to$ capital task from Lindsey et al., 2025, we find a circuit in which small sets of neurons encode specific latent reasoning steps (e.g.~`map city to its state'), and can be steered to change the model's output. This work thus advances automated interpretability of language models without additional training costs.
Related papers
- Catwalk: Unary Top-K for Efficient Ramp-No-Leak Neuron Design for Temporal Neural Networks [3.0670569650183928]
We propose a Catwalk neuron implementation by relocating spikes in a spike volley as a sorted subset cluster via unary top-k.<n>Catwalk is 1.39x and 1.86x better in area and power, respectively, as compared to existing0-RNL neurons.
arXiv Detail & Related papers (2025-08-28T23:50:36Z) - Minimal Neuron Circuits -- Part I: Resonators [1.1624569521079424]
Spiking neurons act as computational units that determine the decision to fire an action potential.<n>This work presents a methodology to implement biologically plausible yet scalable spiking neurons in hardware.<n>We show that it is more efficient to design neurons that mimic the $I_Na,p+I_K$ model rather than the more complicated Hodgkin-Huxley model.
arXiv Detail & Related papers (2025-06-03T00:32:37Z) - Flash Interpretability: Decoding Specialised Feature Neurons in Large Language Models with the LM-Head [0.0]
We show that it is possible to decode neuron weights directly into token probabilities through the final projection layer of a large language model.<n>This is illustrated in Llama 3.1 8B where we use the LM-head to find examples of specialised feature neurons such as a "dog" neuron and a "California" neuron.
arXiv Detail & Related papers (2025-01-05T23:35:47Z) - No One-Size-Fits-All Neurons: Task-based Neurons for Artificial Neural Networks [25.30801109401654]
Since the human brain is a task-based neuron user, can the artificial network design go from the task-based architecture design to the task-based neuron design?
We propose a two-step framework for prototyping task-based neurons.
Experiments show that the proposed task-based neuron design is not only feasible but also delivers competitive performance over other state-of-the-art models.
arXiv Detail & Related papers (2024-05-03T09:12:46Z) - WaLiN-GUI: a graphical and auditory tool for neuron-based encoding [73.88751967207419]
Neuromorphic computing relies on spike-based, energy-efficient communication.
We develop a tool to identify suitable configurations for neuron-based encoding of sample-based data into spike trains.
The WaLiN-GUI is provided open source and with documentation.
arXiv Detail & Related papers (2023-10-25T20:34:08Z) - Sparse Autoencoders Find Highly Interpretable Features in Language
Models [0.0]
Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally.
We use sparse autoencoders to reconstruct the internal activations of a language model.
Our method may serve as a foundation for future mechanistic interpretability work.
arXiv Detail & Related papers (2023-09-15T17:56:55Z) - Neuron to Graph: Interpreting Language Model Neurons at Scale [8.32093320910416]
This paper introduces a novel automated approach designed to scale interpretability techniques across a vast array of neurons within Large Language Models.
We propose Neuron to Graph (N2G), an innovative tool that automatically extracts a neuron's behaviour from the dataset it was trained on and translates it into an interpretable graph.
arXiv Detail & Related papers (2023-05-31T14:44:33Z) - Cross-Model Comparative Loss for Enhancing Neuronal Utility in Language
Understanding [82.46024259137823]
We propose a cross-model comparative loss for a broad range of tasks.
We demonstrate the universal effectiveness of comparative loss through extensive experiments on 14 datasets from 3 distinct NLU tasks.
arXiv Detail & Related papers (2023-01-10T03:04:27Z) - Training Feedback Spiking Neural Networks by Implicit Differentiation on
the Equilibrium State [66.2457134675891]
Spiking neural networks (SNNs) are brain-inspired models that enable energy-efficient implementation on neuromorphic hardware.
Most existing methods imitate the backpropagation framework and feedforward architectures for artificial neural networks.
We propose a novel training method that does not rely on the exact reverse of the forward computation.
arXiv Detail & Related papers (2021-09-29T07:46:54Z) - The Neural Coding Framework for Learning Generative Models [91.0357317238509]
We propose a novel neural generative model inspired by the theory of predictive processing in the brain.
In a similar way, artificial neurons in our generative model predict what neighboring neurons will do, and adjust their parameters based on how well the predictions matched reality.
arXiv Detail & Related papers (2020-12-07T01:20:38Z) - Non-linear Neurons with Human-like Apical Dendrite Activations [81.18416067005538]
We show that a standard neuron followed by our novel apical dendrite activation (ADA) can learn the XOR logical function with 100% accuracy.
We conduct experiments on six benchmark data sets from computer vision, signal processing and natural language processing.
arXiv Detail & Related papers (2020-02-02T21:09:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.