Related papers: Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLMs

Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLMs

URL: http://arxiv.org/abs/2504.04994v2
Date: Sun, 20 Apr 2025 13:04:42 GMT
Title: Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLMs
Authors: Ling Hu, Yuemei Xu, Xiaoyang Gu, Letao Han,
Abstract summary: We propose a novel framework called ValueExploration to explore the behavior-driven mechanisms of National Social Values within large language models.<n>We first identify and locate the neurons responsible for encoding Chinese Social Values in large language models.<n>By deactivating these neurons, we analyze shifts in model behavior, uncovering the internal mechanism by which values influence LLM decision-making.
Score: 2.761261381839981
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the impressive performance of large language models (LLMs), they can present unintended biases and harmful behaviors driven by encoded values, emphasizing the urgent need to understand the value mechanisms behind them. However, current research primarily evaluates these values through external responses with a focus on AI safety, lacking interpretability and failing to assess social values in real-world contexts. In this paper, we propose a novel framework called ValueExploration, which aims to explore the behavior-driven mechanisms of National Social Values within LLMs at the neuron level. As a case study, we focus on Chinese Social Values and first construct C-voice, a large-scale bilingual benchmark for identifying and evaluating Chinese Social Values in LLMs. By leveraging C-voice, we then identify and locate the neurons responsible for encoding these values according to activation difference. Finally, by deactivating these neurons, we analyze shifts in model behavior, uncovering the internal mechanism by which values influence LLM decision-making. Extensive experiments on four representative LLMs validate the efficacy of our framework. The benchmark and code will be available.

Related papers

Understanding How Value Neurons Shape the Generation of Specified Values in LLMs [31.185636385067152]
Integration of large language models into societal applications has intensified concerns about their alignment with universal ethical principles.<n>Current approaches struggle to systematically interpret how values are encoded in neural architectures.<n>We introduce Value, a mechanistic interpretability framework grounded in the Schwartz Survey.
arXiv Detail & Related papers (2025-05-23T10:30:09Z)
An Empirical Study of the Anchoring Effect in LLMs: Existence, Mechanism, and Potential Mitigations [12.481311145515706]
We investigate the anchoring effect, a cognitive bias where the mind relies heavily on the first information as anchors to make affected judgments.<n>To facilitate studies at scale on the anchoring effect, we introduce a new dataset, SynAnchors.<n>Our findings show that LLMs' anchoring bias exists commonly with shallow-layer acting and is not eliminated by conventional strategies.
arXiv Detail & Related papers (2025-05-21T11:33:54Z)
Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models [13.513813405118478]
Large Language Models (LLMs) have raised concerns regarding their elusive intrinsic values.<n>This study addresses the gap by introducing the Generative Psycho-Lexical Approach (GPLA)<n>We propose a psychologically grounded five-factor value system tailored for LLMs.
arXiv Detail & Related papers (2025-02-04T16:10:55Z)
Value Compass Leaderboard: A Platform for Fundamental and Validated Evaluation of LLMs Values [76.70893269183684]
Large Language Models (LLMs) achieve remarkable breakthroughs, aligning their values with humans has become imperative.<n>Existing evaluations focus narrowly on safety risks such as bias and toxicity.<n>Existing benchmarks are prone to data contamination.<n>The pluralistic nature of human values across individuals and cultures is largely ignored in measuring LLMs value alignment.
arXiv Detail & Related papers (2025-01-13T05:53:56Z)
Beyond Human Norms: Unveiling Unique Values of Large Language Models through Interdisciplinary Approaches [69.73783026870998]
This work proposes a novel framework, ValueLex, to reconstruct Large Language Models' unique value system from scratch. Based on Lexical Hypothesis, ValueLex introduces a generative approach to elicit diverse values from 30+ LLMs. We identify three core value dimensions, Competence, Character, and Integrity, each with specific subdimensions, revealing that LLMs possess a structured, albeit non-human, value system.
arXiv Detail & Related papers (2024-04-19T09:44:51Z)
Evaluating Interventional Reasoning Capabilities of Large Language Models [58.52919374786108]
Large language models (LLMs) are used to automate decision-making tasks. In this paper, we evaluate whether LLMs can accurately update their knowledge of a data-generating process in response to an intervention. We create benchmarks that span diverse causal graphs (e.g., confounding, mediation) and variable types. These benchmarks allow us to isolate the ability of LLMs to accurately predict changes resulting from their ability to memorize facts or find other shortcuts.
arXiv Detail & Related papers (2024-04-08T14:15:56Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility [62.74405775089802]
We present CValues, the first Chinese human values evaluation benchmark to measure the alignment ability of LLMs. As a result, we have manually collected adversarial safety prompts across 10 scenarios and induced responsibility prompts from 8 domains. Our findings suggest that while most Chinese LLMs perform well in terms of safety, there is considerable room for improvement in terms of responsibility.
arXiv Detail & Related papers (2023-07-19T01:22:40Z)
Heterogeneous Value Alignment Evaluation for Large Language Models [91.96728871418]
Large Language Models (LLMs) have made it crucial to align their values with those of humans. We propose a Heterogeneous Value Alignment Evaluation (HVAE) system to assess the success of aligning LLMs with heterogeneous values.
arXiv Detail & Related papers (2023-05-26T02:34:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.