NEAT: Concept driven Neuron Attribution in LLMs
- URL: http://arxiv.org/abs/2508.15875v1
- Date: Thu, 21 Aug 2025 10:36:00 GMT
- Title: NEAT: Concept driven Neuron Attribution in LLMs
- Authors: Vivek Hruday Kavuri, Gargi Shroff, Rahul Mishra,
- Abstract summary: Locating neurons that are responsible for final predictions is important for opening the black-box large language models.<n>We propose a method for locating significant neurons that are responsible for representing certain concepts and term those neurons as concept neurons.
- Score: 2.436631469537453
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Locating neurons that are responsible for final predictions is important for opening the black-box large language models and understanding the inside mechanisms. Previous studies have tried to find mechanisms that operate at the neuron level but these methods fail to represent a concept and there is also scope for further optimization of compute required. In this paper, with the help of concept vectors, we propose a method for locating significant neurons that are responsible for representing certain concepts and term those neurons as concept neurons. If the number of neurons is n and the number of examples is m, we reduce the number of forward passes required from O(n*m) to just O(n) compared to the previous works and hence optimizing the time and computation required over previous works. We also compare our method with several baselines and previous methods and our results demonstrate better performance than most of the methods and are more optimal when compared to the state-of-the-art method. We, as part of our ablation studies, also try to optimize the search for the concept neurons by involving clustering methods. Finally, we apply our methods to find, turn off the neurons that we find, and analyze its implications in parts of hate speech and bias in LLMs, and we also evaluate our bias part in terms of Indian context. Our methodology, analysis and explanations facilitate understating of neuron-level responsibility for more broader and human-like concepts and also lay a path for future research in this direction of finding concept neurons and intervening them.
Related papers
- In Search of Grandmother Cells: Tracing Interpretable Neurons in Tabular Representations [1.503974529275767]
We show that some neurons show moderate, statistically significant saliency and selectivity for high-level concepts.<n>These findings suggest that interpretable neurons can emerge naturally and that they can, in some cases, be identified without resorting to more complex interpretability techniques.
arXiv Detail & Related papers (2026-01-07T07:13:01Z) - Understanding Gated Neurons in Transformers from Their Input-Output Functionality [48.91500104957796]
We look at the cosine similarity between input and output weights of a neuron.<n>We find that enrichment neurons dominate in early-middle layers whereas later layers tend more towards depletion.
arXiv Detail & Related papers (2025-05-23T14:14:17Z) - Revisiting Large Language Model Pruning using Neuron Semantic Attribution [63.62836612864512]
We conduct evaluations on 24 datasets and 4 tasks using popular pruning methods.<n>We surprisingly find a significant performance drop of existing pruning methods in sentiment classification tasks.<n>We propose Neuron Semantic Attribution, which learns to associate each neuron with specific semantics.
arXiv Detail & Related papers (2025-03-03T13:52:17Z) - Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis [19.472889262384818]
We find arithmetic ability resides within a limited number of attention heads, with each head specializing in distinct operations.
We introduce the Comparative Neuron Analysis (CNA) method, which identifies an internal logic chain consisting of four distinct stages from input to prediction.
arXiv Detail & Related papers (2024-09-21T13:46:54Z) - Growing Deep Neural Network Considering with Similarity between Neurons [4.32776344138537]
We explore a novel approach of progressively increasing neuron numbers in compact models during training phases.
We propose a method that reduces feature extraction biases and neuronal redundancy by introducing constraints based on neuron similarity distributions.
Results on CIFAR-10 and CIFAR-100 datasets demonstrated accuracy improvement.
arXiv Detail & Related papers (2024-08-23T11:16:37Z) - Neuron-Level Knowledge Attribution in Large Language Models [19.472889262384818]
We propose a static method for pinpointing significant neurons.
Compared to seven other methods, our approach demonstrates superior performance across three metrics.
We also apply our methods to analyze six types of knowledge across both attention and feed-forward network layers.
arXiv Detail & Related papers (2023-12-19T13:23:18Z) - Training Feedback Spiking Neural Networks by Implicit Differentiation on
the Equilibrium State [66.2457134675891]
Spiking neural networks (SNNs) are brain-inspired models that enable energy-efficient implementation on neuromorphic hardware.
Most existing methods imitate the backpropagation framework and feedforward architectures for artificial neural networks.
We propose a novel training method that does not rely on the exact reverse of the forward computation.
arXiv Detail & Related papers (2021-09-29T07:46:54Z) - Dynamic Neural Diversification: Path to Computationally Sustainable
Neural Networks [68.8204255655161]
Small neural networks with a constrained number of trainable parameters, can be suitable resource-efficient candidates for many simple tasks.
We explore the diversity of the neurons within the hidden layer during the learning process.
We analyze how the diversity of the neurons affects predictions of the model.
arXiv Detail & Related papers (2021-09-20T15:12:16Z) - And/or trade-off in artificial neurons: impact on adversarial robustness [91.3755431537592]
Presence of sufficient number of OR-like neurons in a network can lead to classification brittleness and increased vulnerability to adversarial attacks.
We define AND-like neurons and propose measures to increase their proportion in the network.
Experimental results on the MNIST dataset suggest that our approach holds promise as a direction for further exploration.
arXiv Detail & Related papers (2021-02-15T08:19:05Z) - The Neural Coding Framework for Learning Generative Models [91.0357317238509]
We propose a novel neural generative model inspired by the theory of predictive processing in the brain.
In a similar way, artificial neurons in our generative model predict what neighboring neurons will do, and adjust their parameters based on how well the predictions matched reality.
arXiv Detail & Related papers (2020-12-07T01:20:38Z) - Factorized Neural Processes for Neural Processes: $K$-Shot Prediction of
Neural Responses [9.792408261365043]
We develop a Factorized Neural Process to infer a neuron's tuning function from a small set of stimulus-response pairs.
We show on simulated responses that the predictions and reconstructed receptive fields from the Neural Process approach ground truth with increasing number of trials.
We believe this novel deep learning systems identification framework will facilitate better real-time integration of artificial neural network modeling into neuroscience experiments.
arXiv Detail & Related papers (2020-10-22T15:43:59Z) - Compositional Explanations of Neurons [52.71742655312625]
We describe a procedure for explaining neurons in deep representations by identifying compositional logical concepts.
We use this procedure to answer several questions on interpretability in models for vision and natural language processing.
arXiv Detail & Related papers (2020-06-24T20:37:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.