Understanding How Value Neurons Shape the Generation of Specified Values in LLMs
- URL: http://arxiv.org/abs/2505.17712v1
- Date: Fri, 23 May 2025 10:30:09 GMT
- Title: Understanding How Value Neurons Shape the Generation of Specified Values in LLMs
- Authors: Yi Su, Jiayi Zhang, Shu Yang, Xinhai Wang, Lijie Hu, Di Wang,
- Abstract summary: Integration of large language models into societal applications has intensified concerns about their alignment with universal ethical principles.<n>Current approaches struggle to systematically interpret how values are encoded in neural architectures.<n>We introduce Value, a mechanistic interpretability framework grounded in the Schwartz Survey.
- Score: 31.185636385067152
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Rapid integration of large language models (LLMs) into societal applications has intensified concerns about their alignment with universal ethical principles, as their internal value representations remain opaque despite behavioral alignment advancements. Current approaches struggle to systematically interpret how values are encoded in neural architectures, limited by datasets that prioritize superficial judgments over mechanistic analysis. We introduce ValueLocate, a mechanistic interpretability framework grounded in the Schwartz Values Survey, to address this gap. Our method first constructs ValueInsight, a dataset that operationalizes four dimensions of universal value through behavioral contexts in the real world. Leveraging this dataset, we develop a neuron identification method that calculates activation differences between opposing value aspects, enabling precise localization of value-critical neurons without relying on computationally intensive attribution methods. Our proposed validation method demonstrates that targeted manipulation of these neurons effectively alters model value orientations, establishing causal relationships between neurons and value representations. This work advances the foundation for value alignment by bridging psychological value frameworks with neuron analysis in LLMs.
Related papers
- Concept-Guided Interpretability via Neural Chunking [54.73787666584143]
We show that neural networks exhibit patterns in their raw population activity that mirror regularities in the training data.<n>We propose three methods to extract these emerging entities, complementing each other based on label availability and dimensionality.<n>Our work points to a new direction for interpretability, one that harnesses both cognitive principles and the structure of naturalistic data.
arXiv Detail & Related papers (2025-05-16T13:49:43Z) - Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLMs [2.761261381839981]
We propose a novel framework called ValueExploration to explore the behavior-driven mechanisms of National Social Values within large language models.<n>We first identify and locate the neurons responsible for encoding Chinese Social Values in large language models.<n>By deactivating these neurons, we analyze shifts in model behavior, uncovering the internal mechanism by which values influence LLM decision-making.
arXiv Detail & Related papers (2025-04-07T12:23:59Z) - Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$P Parametrization [66.03821840425539]
In this paper, we investigate the training dynamics of $L$-layer neural networks using the tensor gradient program (SGD) framework.<n>We show that SGD enables these networks to learn linearly independent features that substantially deviate from their initial values.<n>This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum.
arXiv Detail & Related papers (2025-03-12T17:33:13Z) - Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution [16.460751105639623]
We show that even highly salient neurons consistently exhibit polysemantic behavior.<n>This observation motivates a shift from neuron attribution to range-based interpretation.<n>We introduce NeuronLens, a novel range-based interpretation and manipulation framework.
arXiv Detail & Related papers (2025-02-04T03:33:55Z) - Neural network interpretability with layer-wise relevance propagation: novel techniques for neuron selection and visualization [0.49478969093606673]
We present a novel approach that improves the parsing of selected neurons during.<n>LRP backward propagation, using the Visual Geometry Group 16 (VGG16) architecture as a case study.<n>Our approach enhances interpretability and supports the development of more transparent artificial intelligence (AI) systems for computer vision applications.
arXiv Detail & Related papers (2024-12-07T15:49:14Z) - Towards Utilising a Range of Neural Activations for Comprehending Representational Associations [0.6554326244334868]
We show that an approach to label intermediate representations in deep neural networks fails to capture valuable information about their behaviour.
We hypothesise that non-extremal level activations contain complex information worth investigating.
We use our findings to develop a method to curate data from mid-range logit samples for retraining to mitigate spurious correlations.
arXiv Detail & Related papers (2024-11-15T07:54:14Z) - TractGeoNet: A geometric deep learning framework for pointwise analysis
of tract microstructure to predict language assessment performance [66.43360974979386]
We propose a geometric deep-learning-based framework, TractGeoNet, for performing regression using diffusion magnetic resonance imaging (dMRI) tractography.
To improve regression performance, we propose a novel loss function, the Paired-Siamese Regression loss.
We evaluate the effectiveness of the proposed method by predicting individual performance on two neuropsychological assessments of language.
arXiv Detail & Related papers (2023-07-08T14:10:37Z) - Heterogeneous Value Alignment Evaluation for Large Language Models [91.96728871418]
Large Language Models (LLMs) have made it crucial to align their values with those of humans.
We propose a Heterogeneous Value Alignment Evaluation (HVAE) system to assess the success of aligning LLMs with heterogeneous values.
arXiv Detail & Related papers (2023-05-26T02:34:20Z) - Overcoming the Domain Gap in Contrastive Learning of Neural Action
Representations [60.47807856873544]
A fundamental goal in neuroscience is to understand the relationship between neural activity and behavior.
We generated a new multimodal dataset consisting of the spontaneous behaviors generated by fruit flies.
This dataset and our new set of augmentations promise to accelerate the application of self-supervised learning methods in neuroscience.
arXiv Detail & Related papers (2021-11-29T15:27:51Z) - Interpreting Deep Neural Networks with Relative Sectional Propagation by
Analyzing Comparative Gradients and Hostile Activations [37.11665902583138]
We propose a new attribution method, Relative Sectional Propagation (RSP), for decomposing the output predictions of Deep Neural Networks (DNNs)
We define hostile factor as an element that interferes with finding the attributions of the target and propagates it in a distinguishable way to overcome the non-suppressed nature of activated neurons.
Our method makes it possible to decompose the predictions of DNNs with clearer class-discriminativeness and detailed elucidations of activation neurons compared to the conventional attribution methods.
arXiv Detail & Related papers (2020-12-07T03:11:07Z) - Provably Efficient Neural Estimation of Structural Equation Model: An
Adversarial Approach [144.21892195917758]
We study estimation in a class of generalized Structural equation models (SEMs)
We formulate the linear operator equation as a min-max game, where both players are parameterized by neural networks (NNs), and learn the parameters of these neural networks using a gradient descent.
For the first time we provide a tractable estimation procedure for SEMs based on NNs with provable convergence and without the need for sample splitting.
arXiv Detail & Related papers (2020-07-02T17:55:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.