The Knowledge Microscope: Features as Better Analytical Lenses than Neurons
- URL: http://arxiv.org/abs/2502.12483v1
- Date: Tue, 18 Feb 2025 03:09:55 GMT
- Title: The Knowledge Microscope: Features as Better Analytical Lenses than Neurons
- Authors: Yuheng Chen, Pengfei Cao, Kang Liu, Jun Zhao,
- Abstract summary: Previous studies primarily utilize neurons as units of analysis for understanding the mechanisms of factual knowledge in Language Models (LMs)
In this paper, we first conduct preliminary experiments to validate that Sparse Autoencoders (SAE) can effectively decompose neurons into features, which serve as alternative analytical units.
- Score: 15.883209651151155
- License:
- Abstract: Previous studies primarily utilize MLP neurons as units of analysis for understanding the mechanisms of factual knowledge in Language Models (LMs); however, neurons suffer from polysemanticity, leading to limited knowledge expression and poor interpretability. In this paper, we first conduct preliminary experiments to validate that Sparse Autoencoders (SAE) can effectively decompose neurons into features, which serve as alternative analytical units. With this established, our core findings reveal three key advantages of features over neurons: (1) Features exhibit stronger influence on knowledge expression and superior interpretability. (2) Features demonstrate enhanced monosemanticity, showing distinct activation patterns between related and unrelated facts. (3) Features achieve better privacy protection than neurons, demonstrated through our proposed FeatureEdit method, which significantly outperforms existing neuron-based approaches in erasing privacy-sensitive information from LMs.Code and dataset will be available.
Related papers
- Single-neuron deep generative model uncovers underlying physics of neuronal activity in Ca imaging data [0.0]
We propose a novel framework for single-neuron representation learning using autoregressive variational autoencoders (AVAEs)
Our approach embeds individual neurons' signals into a reduced-dimensional space without the need for spike inference algorithms.
The AVAE excels over traditional linear methods by generating more informative and discriminative latent representations.
arXiv Detail & Related papers (2025-01-24T16:33:52Z) - The unbearable lightness of Restricted Boltzmann Machines: Theoretical Insights and Biological Applications [0.0]
We focus on reviewing the role that the activation functions, describing the input-output relationship of single neurons in RBM, play in the functionality of these models.
We discuss recent theoretical results on the benefits and limitations of different activation functions.
We also review applications to biological data analysis, namely neural data analysis, where RBM units are mostly taken to have sigmoid activation functions and binary units, to protein data analysis and immunology where non-binary units and non-sigmoid activation functions have recently been shown to yield important insights into the data.
arXiv Detail & Related papers (2025-01-08T09:57:08Z) - Neuron Empirical Gradient: Discovering and Quantifying Neurons Global Linear Controllability [14.693407823048478]
Our study first investigates the numerical relationship between neuron activations and model output.
We introduce NeurGrad, an accurate and efficient method for computing neuron empirical gradient (NEG)
arXiv Detail & Related papers (2024-12-24T00:01:24Z) - Brain-like Functional Organization within Large Language Models [58.93629121400745]
The human brain has long inspired the pursuit of artificial intelligence (AI)
Recent neuroimaging studies provide compelling evidence of alignment between the computational representation of artificial neural networks (ANNs) and the neural responses of the human brain to stimuli.
In this study, we bridge this gap by directly coupling sub-groups of artificial neurons with functional brain networks (FBNs)
This framework links the AN sub-groups to FBNs, enabling the delineation of brain-like functional organization within large language models (LLMs)
arXiv Detail & Related papers (2024-10-25T13:15:17Z) - Neuron-based Personality Trait Induction in Large Language Models [115.08894603023712]
Large language models (LLMs) have become increasingly proficient at simulating various personality traits.
We present a neuron-based approach for personality trait induction in LLMs.
arXiv Detail & Related papers (2024-10-16T07:47:45Z) - A generative framework to bridge data-driven models and scientific theories in language neuroscience [84.76462599023802]
We present generative explanation-mediated validation, a framework for generating concise explanations of language selectivity in the brain.
We show that explanatory accuracy is closely related to the predictive power and stability of the underlying statistical models.
arXiv Detail & Related papers (2024-10-01T15:57:48Z) - Identification of Knowledge Neurons in Protein Language Models [0.0]
We identify and characterizing knowledge neurons, components that express understanding of key information.
We show that there is a high density of knowledge neurons in the key vector prediction networks of self-attention modules.
In the future, the types of knowledge captured by each neuron could be characterized.
arXiv Detail & Related papers (2023-12-17T17:23:43Z) - Finding and Editing Multi-Modal Neurons in Pre-Trained Transformers [24.936419036304855]
We propose a novel method to identify key neurons for interpretability.
Our method improves conventional works upon efficiency and applied range by removing needs of costly gradient computation.
Based on those identified neurons, we further design a multi-modal knowledge editing method, beneficial to mitigate sensitive words or hallucination.
arXiv Detail & Related papers (2023-11-13T17:03:02Z) - Overcoming the Domain Gap in Contrastive Learning of Neural Action
Representations [60.47807856873544]
A fundamental goal in neuroscience is to understand the relationship between neural activity and behavior.
We generated a new multimodal dataset consisting of the spontaneous behaviors generated by fruit flies.
This dataset and our new set of augmentations promise to accelerate the application of self-supervised learning methods in neuroscience.
arXiv Detail & Related papers (2021-11-29T15:27:51Z) - CogAlign: Learning to Align Textual Neural Representations to Cognitive
Language Processing Signals [60.921888445317705]
We propose a CogAlign approach to integrate cognitive language processing signals into natural language processing models.
We show that CogAlign achieves significant improvements with multiple cognitive features over state-of-the-art models on public datasets.
arXiv Detail & Related papers (2021-06-10T07:10:25Z) - Compositional Explanations of Neurons [52.71742655312625]
We describe a procedure for explaining neurons in deep representations by identifying compositional logical concepts.
We use this procedure to answer several questions on interpretability in models for vision and natural language processing.
arXiv Detail & Related papers (2020-06-24T20:37:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.