Understanding Gated Neurons in Transformers from Their Input-Output Functionality
- URL: http://arxiv.org/abs/2505.17936v1
- Date: Fri, 23 May 2025 14:14:17 GMT
- Title: Understanding Gated Neurons in Transformers from Their Input-Output Functionality
- Authors: Sebastian Gerstner, Hinrich Schütze,
- Abstract summary: We look at the cosine similarity between input and output weights of a neuron.<n>We find that enrichment neurons dominate in early-middle layers whereas later layers tend more towards depletion.
- Score: 48.91500104957796
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Interpretability researchers have attempted to understand MLP neurons of language models based on both the contexts in which they activate and their output weight vectors. They have paid little attention to a complementary aspect: the interactions between input and output. For example, when neurons detect a direction in the input, they might add much the same direction to the residual stream ("enrichment neurons") or reduce its presence ("depletion neurons"). We address this aspect by examining the cosine similarity between input and output weights of a neuron. We apply our method to 12 models and find that enrichment neurons dominate in early-middle layers whereas later layers tend more towards depletion. To explain this finding, we argue that enrichment neurons are largely responsible for enriching concept representations, one of the first steps of factual recall. Our input-output perspective is a complement to activation-dependent analyses and to approaches that treat input and output separately.
Related papers
- NeurFlow: Interpreting Neural Networks through Neuron Groups and Functional Interactions [16.00223741620103]
We propose a novel framework that transitions the focus from analyzing individual neurons to investigating groups of neurons.<n>Our automated framework, NeurFlow, first identifies core neurons and clusters them into groups based on shared functional relationships.
arXiv Detail & Related papers (2025-02-22T06:01:03Z) - Modeling Dynamic Neural Activity by combining Naturalistic Video Stimuli and Stimulus-independent Latent Factors [5.967290675400836]
We propose a probabilistic model that predicts the joint distribution of the neuronal responses from video stimuli and stimulus-independent latent factors.<n>We find that it outperforms video-only models in terms of log-likelihood and achieves improvements in likelihood and correlation when conditioned on responses from other neurons.
arXiv Detail & Related papers (2024-10-21T16:01:39Z) - Confidence Regulation Neurons in Language Models [91.90337752432075]
This study investigates the mechanisms by which large language models represent and regulate uncertainty in next-token predictions.
Entropy neurons are characterized by an unusually high weight norm and influence the final layer normalization (LayerNorm) scale to effectively scale down the logits.
token frequency neurons, which we describe here for the first time, boost or suppress each token's logit proportionally to its log frequency, thereby shifting the output distribution towards or away from the unigram distribution.
arXiv Detail & Related papers (2024-06-24T01:31:03Z) - Interpreting the Second-Order Effects of Neurons in CLIP [73.54377859089801]
We interpret the function of individual neurons in CLIP by automatically describing them using text.<n>We present the "second-order lens", analyzing the effect flowing from a neuron through the later attention heads, directly to the output.<n>Our results indicate that an automated interpretation of neurons can be used for model deception and for introducing new model capabilities.
arXiv Detail & Related papers (2024-06-06T17:59:52Z) - Wasserstein Distances, Neuronal Entanglement, and Sparsity [32.403833231587846]
We study how disentanglement can be used to understand performance, particularly under weight sparsity.<n>We show the existence of a small number of highly entangled "Wasserstein Neurons" in each linear layer of an LLM.<n>Our framework separates each layer's inputs to create a mixture of experts where each neuron's output is computed by a mixture of neurons of lower Wasserstein distance.
arXiv Detail & Related papers (2024-05-24T17:51:39Z) - And/or trade-off in artificial neurons: impact on adversarial robustness [91.3755431537592]
Presence of sufficient number of OR-like neurons in a network can lead to classification brittleness and increased vulnerability to adversarial attacks.
We define AND-like neurons and propose measures to increase their proportion in the network.
Experimental results on the MNIST dataset suggest that our approach holds promise as a direction for further exploration.
arXiv Detail & Related papers (2021-02-15T08:19:05Z) - Compositional Explanations of Neurons [52.71742655312625]
We describe a procedure for explaining neurons in deep representations by identifying compositional logical concepts.
We use this procedure to answer several questions on interpretability in models for vision and natural language processing.
arXiv Detail & Related papers (2020-06-24T20:37:05Z) - Optimal Learning with Excitatory and Inhibitory synapses [91.3755431537592]
I study the problem of storing associations between analog signals in the presence of correlations.
I characterize the typical learning performance in terms of the power spectrum of random input and output processes.
arXiv Detail & Related papers (2020-05-25T18:25:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.