Related papers: Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution

Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution

URL: http://arxiv.org/abs/2502.06809v2
Date: Wed, 21 May 2025 03:16:45 GMT
Title: Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution
Authors: Muhammad Umair Haider, Hammad Rizwan, Hassan Sajjad, Peizhong Ju, A. B. Siddique,
Abstract summary: We show that even highly salient neurons consistently exhibit polysemantic behavior.<n>This observation motivates a shift from neuron attribution to range-based interpretation.<n>We introduce NeuronLens, a novel range-based interpretation and manipulation framework.
Score: 16.460751105639623
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Interpreting the internal mechanisms of large language models (LLMs) is crucial for improving their trustworthiness and utility. Prior work has primarily focused on mapping individual neurons to discrete semantic concepts. However, such mappings struggle to handle the inherent polysemanticity in LLMs, where individual neurons encode multiple, distinct concepts. Through a comprehensive analysis of both encoder and decoder-based LLMs across diverse datasets, we observe that even highly salient neurons, identified via various attribution techniques for specific semantic concepts, consistently exhibit polysemantic behavior. Importantly, activation magnitudes for fine-grained concepts follow distinct, often Gaussian-like distributions with minimal overlap. This observation motivates a shift from neuron attribution to range-based interpretation. We hypothesize that interpreting and manipulating neuron activation ranges would enable more precise interpretability and targeted interventions in LLMs. To validate our hypothesis, we introduce NeuronLens, a novel range-based interpretation and manipulation framework that provides a finer view of neuron activation distributions to localize concept attribution within a neuron. Extensive empirical evaluations demonstrate that NeuronLens significantly reduces unintended interference, while maintaining precise manipulation of targeted concepts, outperforming neuron attribution.

Related papers

NOBLE -- Neural Operator with Biologically-informed Latent Embeddings to Capture Experimental Variability in Biological Neuron Models [68.89389652724378]
NOBLE is a neural operator framework that learns a mapping from a continuous frequency-modulated embedding of interpretable neuron features to the somatic voltage response induced by current injection.<n>It predicts distributions of neural dynamics accounting for the intrinsic experimental variability.<n>NOBLE is the first scaled-up deep learning framework validated on real experimental data.
arXiv Detail & Related papers (2025-06-05T01:01:18Z)
Concept-Guided Interpretability via Neural Chunking [54.73787666584143]
We show that neural networks exhibit patterns in their raw population activity that mirror regularities in the training data.<n>We propose three methods to extract these emerging entities, complementing each other based on label availability and dimensionality.<n>Our work points to a new direction for interpretability, one that harnesses both cognitive principles and the structure of naturalistic data.
arXiv Detail & Related papers (2025-05-16T13:49:43Z)
Discovering Chunks in Neural Embeddings for Interpretability [53.80157905839065]
We propose leveraging the principle of chunking to interpret artificial neural population activities.<n>We first demonstrate this concept in recurrent neural networks (RNNs) trained on artificial sequences with imposed regularities.<n>We identify similar recurring embedding states corresponding to concepts in the input, with perturbations to these states activating or inhibiting the associated concepts.
arXiv Detail & Related papers (2025-02-03T20:30:46Z)
Compositional Concept-Based Neuron-Level Interpretability for Deep Reinforcement Learning [2.9539724161670167]
Deep reinforcement learning (DRL) has successfully addressed many complex control problems.<n>Current DRL interpretability methods largely treat neural networks as black boxes.<n>We propose a novel concept-based interpretability method that provides fine-grained explanations of DRL models at the neuron level.
arXiv Detail & Related papers (2025-02-02T06:05:49Z)
QuantFormer: Learning to Quantize for Neural Activity Forecasting in Mouse Visual Cortex [26.499583552980248]
QuantFormer is a transformer-based model specifically designed for forecasting neural activity from two-photon calcium imaging data.<n> QuantFormer sets a new benchmark in forecasting mouse visual cortex activity.<n>It demonstrates robust performance and generalization across various stimuli and individuals.
arXiv Detail & Related papers (2024-12-10T07:44:35Z)
Artificial Kuramoto Oscillatory Neurons [65.16453738828672]
We introduce Artificial Kuramotoy Neurons (AKOrN) as a dynamical alternative to threshold units. We show that this idea provides performance improvements across a wide spectrum of tasks. We believe that these empirical results show the importance of our assumptions at the most basic neuronal level of neural representation.
arXiv Detail & Related papers (2024-10-17T17:47:54Z)
Range, not Independence, Drives Modularity in Biologically Inspired Representations [52.48094670415497]
We develop a theory of when biologically inspired networks modularise their representation of source variables (sources)<n>We derive necessary and sufficient conditions on a sample of sources that determine whether the neurons in an optimal linear autoencoder modularise.<n>Our theory applies to any dataset, extending far beyond the case of statistical independence studied in previous work.
arXiv Detail & Related papers (2024-10-08T17:41:37Z)
ConceptLens: from Pixels to Understanding [1.3466710708566176]
ConceptLens is an innovative tool designed to illuminate the intricate workings of deep neural networks (DNNs) by visualizing hidden neuron activations. By integrating deep learning with symbolic methods, ConceptLens offers users a unique way to understand what triggers neuron activations.
arXiv Detail & Related papers (2024-10-04T20:49:12Z)
Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis [19.472889262384818]
We find arithmetic ability resides within a limited number of attention heads, with each head specializing in distinct operations. We introduce the Comparative Neuron Analysis (CNA) method, which identifies an internal logic chain consisting of four distinct stages from input to prediction.
arXiv Detail & Related papers (2024-09-21T13:46:54Z)
Interpreting the Second-Order Effects of Neurons in CLIP [73.54377859089801]
We interpret the function of individual neurons in CLIP by automatically describing them using text.<n>We present the "second-order lens", analyzing the effect flowing from a neuron through the later attention heads, directly to the output.<n>Our results indicate that an automated interpretation of neurons can be used for model deception and for introducing new model capabilities.
arXiv Detail & Related papers (2024-06-06T17:59:52Z)
Manipulating Feature Visualizations with Gradient Slingshots [54.31109240020007]
We introduce a novel method for manipulating Feature Visualization (FV) without significantly impacting the model's decision-making process. We evaluate the effectiveness of our method on several neural network models and demonstrate its capabilities to hide the functionality of arbitrarily chosen neurons.
arXiv Detail & Related papers (2024-01-11T18:57:17Z)
Learning from Emergence: A Study on Proactively Inhibiting the Monosemantic Neurons of Artificial Neural Networks [10.390475063385756]
We propose a new metric to measure the monosemanticity of neurons with the guarantee of efficiency for online computation. We validate our conjecture that monosemanticity brings about performance change at different model scales.
arXiv Detail & Related papers (2023-12-17T14:42:46Z)
Automated Natural Language Explanation of Deep Visual Neurons with Large Models [43.178568768100305]
This paper proposes a novel post-hoc framework for generating semantic explanations of neurons with large foundation models. Our framework is designed to be compatible with various model architectures and datasets, automated and scalable neuron interpretation.
arXiv Detail & Related papers (2023-10-16T17:04:51Z)
Cones: Concept Neurons in Diffusion Models for Customized Generation [41.212255848052514]
This paper finds a small cluster of neurons in a diffusion model corresponding to a particular subject. The concept neurons demonstrate magnetic properties in interpreting and manipulating generation results. For large-scale applications, the concept neurons are environmentally friendly as we only need to store a sparse cluster of int index instead of dense float32 values.
arXiv Detail & Related papers (2023-03-09T09:16:04Z)
Overcoming the Domain Gap in Contrastive Learning of Neural Action Representations [60.47807856873544]
A fundamental goal in neuroscience is to understand the relationship between neural activity and behavior. We generated a new multimodal dataset consisting of the spontaneous behaviors generated by fruit flies. This dataset and our new set of augmentations promise to accelerate the application of self-supervised learning methods in neuroscience.
arXiv Detail & Related papers (2021-11-29T15:27:51Z)
And/or trade-off in artificial neurons: impact on adversarial robustness [91.3755431537592]
Presence of sufficient number of OR-like neurons in a network can lead to classification brittleness and increased vulnerability to adversarial attacks. We define AND-like neurons and propose measures to increase their proportion in the network. Experimental results on the MNIST dataset suggest that our approach holds promise as a direction for further exploration.
arXiv Detail & Related papers (2021-02-15T08:19:05Z)
Compositional Explanations of Neurons [52.71742655312625]
We describe a procedure for explaining neurons in deep representations by identifying compositional logical concepts. We use this procedure to answer several questions on interpretability in models for vision and natural language processing.
arXiv Detail & Related papers (2020-06-24T20:37:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.