Related papers: NeuronScope: A Multi-Agent Framework for Explaining Polysemantic Neurons in Language Models

NeuronScope: A Multi-Agent Framework for Explaining Polysemantic Neurons in Language Models

URL: http://arxiv.org/abs/2601.03671v1
Date: Wed, 07 Jan 2026 07:50:47 GMT
Title: NeuronScope: A Multi-Agent Framework for Explaining Polysemantic Neurons in Language Models
Authors: Weiqi Liu, Yongliang Miao, Haiyan Zhao, Yanguang Liu, Mengnan Du,
Abstract summary: NeuronScope is a multi-agent framework that reformulates neuron interpretation as an iterative, activation-guided process.<n>We show that NeuronScope uncovers hidden polysemanticity and produces explanations with significantly higher activation correlation compared to single-pass baselines.
Score: 24.550940304055562
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Neuron-level interpretation in large language models (LLMs) is fundamentally challenged by widespread polysemanticity, where individual neurons respond to multiple distinct semantic concepts. Existing single-pass interpretation methods struggle to faithfully capture such multi-concept behavior. In this work, we propose NeuronScope, a multi-agent framework that reformulates neuron interpretation as an iterative, activation-guided process. NeuronScope explicitly deconstructs neuron activations into atomic semantic components, clusters them into distinct semantic modes, and iteratively refines each explanation using neuron activation feedback. Experiments demonstrate that NeuronScope uncovers hidden polysemanticity and produces explanations with significantly higher activation correlation compared to single-pass baselines.

Related papers

Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering [60.23509717784518]
Existing mitigation methods predominantly focus on output-level adjustments, leaving internal mechanisms that give rise to hallucinations largely unexplored.<n>We propose Contrastive Neuron Steering ( CNS), which identifies image-specific neurons via contrastive analysis between clean and noisy inputs.<n> CNS selectively amplifies informative neurons while suppressing perturbation-induced activations, producing more robust and semantically grounded visual representations.
arXiv Detail & Related papers (2026-01-31T09:21:04Z)
In Search of Grandmother Cells: Tracing Interpretable Neurons in Tabular Representations [1.503974529275767]
We show that some neurons show moderate, statistically significant saliency and selectivity for high-level concepts.<n>These findings suggest that interpretable neurons can emerge naturally and that they can, in some cases, be identified without resorting to more complex interpretability techniques.
arXiv Detail & Related papers (2026-01-07T07:13:01Z)
Neuronal Group Communication for Efficient Neural representation [85.36421257648294]
This paper addresses the question of how to build large neural systems that learn efficient, modular, and interpretable representations.<n>We propose Neuronal Group Communication (NGC), a theory-driven framework that reimagines a neural network as a dynamical system of interacting neuronal groups.<n>NGC treats weights as transient interactions between embedding-like neuronal states, with neural computation unfolding through iterative communication among groups of neurons.
arXiv Detail & Related papers (2025-10-19T14:23:35Z)
NOBLE -- Neural Operator with Biologically-informed Latent Embeddings to Capture Experimental Variability in Biological Neuron Models [63.592664795493725]
NOBLE is a neural operator framework that learns a mapping from a continuous frequency-modulated embedding of interpretable neuron features to the somatic voltage response induced by current injection.<n>It predicts distributions of neural dynamics accounting for the intrinsic experimental variability.<n>NOBLE is the first scaled-up deep learning framework that validates its generalization with real experimental data.
arXiv Detail & Related papers (2025-06-05T01:01:18Z)
Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution [16.460751105639623]
We show that even highly salient neurons consistently exhibit polysemantic behavior.<n>This observation motivates a shift from neuron attribution to range-based interpretation.<n>We introduce NeuronLens, a novel range-based interpretation and manipulation framework.
arXiv Detail & Related papers (2025-02-04T03:33:55Z)
Artificial Kuramoto Oscillatory Neurons [65.16453738828672]
It has long been known in both neuroscience and AI that ''binding'' between neurons leads to a form of competitive learning where representations are compressed in order to represent more abstract concepts in deeper layers of the network.<n>We introduce Artificial rethinking together with arbitrary connectivity designs such as fully connected convolutional, or attentive mechanisms.<n>We show that this idea provides performance improvements across a wide spectrum of tasks such as unsupervised object discovery, adversarial robustness, uncertainty, quantification, and reasoning.
arXiv Detail & Related papers (2024-10-17T17:47:54Z)
Interpreting the Second-Order Effects of Neurons in CLIP [73.54377859089801]
We interpret the function of individual neurons in CLIP by automatically describing them using text.<n>We present the "second-order lens", analyzing the effect flowing from a neuron through the later attention heads, directly to the output.<n>Our results indicate that an automated interpretation of neurons can be used for model deception and for introducing new model capabilities.
arXiv Detail & Related papers (2024-06-06T17:59:52Z)
Learning from Emergence: A Study on Proactively Inhibiting the Monosemantic Neurons of Artificial Neural Networks [10.390475063385756]
We propose a new metric to measure the monosemanticity of neurons with the guarantee of efficiency for online computation. We validate our conjecture that monosemanticity brings about performance change at different model scales.
arXiv Detail & Related papers (2023-12-17T14:42:46Z)
Disentangling Neuron Representations with Concept Vectors [0.0]
The main contribution of this paper is a method to disentangle polysemantic neurons into concept vectors encapsulating distinct features. Our evaluations show that the concept vectors found encode coherent, human-understandable features.
arXiv Detail & Related papers (2023-04-19T14:55:31Z)
Cones: Concept Neurons in Diffusion Models for Customized Generation [41.212255848052514]
This paper finds a small cluster of neurons in a diffusion model corresponding to a particular subject. The concept neurons demonstrate magnetic properties in interpreting and manipulating generation results. For large-scale applications, the concept neurons are environmentally friendly as we only need to store a sparse cluster of int index instead of dense float32 values.
arXiv Detail & Related papers (2023-03-09T09:16:04Z)
Neuron-based explanations of neural networks sacrifice completeness and interpretability [67.53271920386851]
We show that for AlexNet pretrained on ImageNet, neuron-based explanation methods sacrifice both completeness and interpretability.<n>We show the most important principal components provide more complete and interpretable explanations than the most important neurons.<n>Our findings suggest that explanation methods for networks like AlexNet should avoid using neurons as a basis for embeddings.
arXiv Detail & Related papers (2020-11-05T21:26:03Z)
Compositional Explanations of Neurons [52.71742655312625]
We describe a procedure for explaining neurons in deep representations by identifying compositional logical concepts. We use this procedure to answer several questions on interpretability in models for vision and natural language processing.
arXiv Detail & Related papers (2020-06-24T20:37:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.