Counterfactual Reasoning for Steerable Pluralistic Value Alignment of Large Language Models
- URL: http://arxiv.org/abs/2510.18526v1
- Date: Tue, 21 Oct 2025 11:12:45 GMT
- Title: Counterfactual Reasoning for Steerable Pluralistic Value Alignment of Large Language Models
- Authors: Hanze Guo, Jing Yao, Xiao Zhou, Xiaoyuan Yi, Xing Xie,
- Abstract summary: COUPLE is a COUnterfactual reasoning framework for PLuralistic valuE alignment.<n>It features complex interdependency and prioritization among features, as well as the causal relationship between high-level value dimensions and behaviors.<n>Benefitting from explicit causal modeling, COUPLE also provides better interpretability.
- Score: 43.01088871836861
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: As large language models (LLMs) become increasingly integrated into applications serving users across diverse cultures, communities and demographics, it is critical to align LLMs with pluralistic human values beyond average principles (e.g., HHH). In psychological and social value theories such as Schwartz's Value Theory, pluralistic values are represented by multiple value dimensions paired with various priorities. However, existing methods encounter two challenges when aligning with such fine-grained value objectives: 1) they often treat multiple values as independent and equally important, ignoring their interdependence and relative priorities (value complexity); 2) they struggle to precisely control nuanced value priorities, especially those underrepresented ones (value steerability). To handle these challenges, we propose COUPLE, a COUnterfactual reasoning framework for PLuralistic valuE alignment. It introduces a structural causal model (SCM) to feature complex interdependency and prioritization among features, as well as the causal relationship between high-level value dimensions and behaviors. Moreover, it applies counterfactual reasoning to generate outputs aligned with any desired value objectives. Benefitting from explicit causal modeling, COUPLE also provides better interpretability. We evaluate COUPLE on two datasets with different value systems and demonstrate that COUPLE advances other baselines across diverse types of value objectives.
Related papers
- Learning the Value Systems of Societies with Preference-based Multi-objective Reinforcement Learning [4.735670734773144]
Value-aware AI should recognise human values and adapt to the value systems (value-based preferences) of different users.<n>We propose algorithms for learning models of value alignment and value systems for a society of agents.
arXiv Detail & Related papers (2026-02-09T16:06:36Z) - Growth First, Care Second? Tracing the Landscape of LLM Value Preferences in Everyday Dilemmas [5.1141034187487175]
We examine the value trade-off structure underlying advice seeking using a curated dataset from four advice-oriented subreddits.<n>We construct value co-occurrence networks to characterize how values co-occur within dilemmas.<n>We find that, across models and contexts, LLMs consistently prioritize values related to Exploration & Growth over Benevolence & Connection.
arXiv Detail & Related papers (2026-02-04T11:41:27Z) - Rethinking How AI Embeds and Adapts to Human Values: Challenges and Opportunities [0.6113558800822273]
We argue that AI systems should implement long-term reasoning and remain adaptable to evolving values.<n>Value alignment requires more theories to address the full spectrum of human values.<n>We identify the challenges associated with value alignment and indicate directions for advancing value alignment research.
arXiv Detail & Related papers (2025-08-23T18:19:05Z) - Evaluating AI Alignment in Eleven LLMs through Output-Based Analysis and Human Benchmarking [0.0]
Large language models (LLMs) are increasingly used in psychological research and practice, yet traditional benchmarks reveal little about the values they express in real interaction.<n>We introduce PAPERS, output-based evaluation of the values LLMs express.
arXiv Detail & Related papers (2025-06-14T20:14:02Z) - Preference Learning for AI Alignment: a Causal Perspective [55.2480439325792]
We frame this problem in a causal paradigm, providing the rich toolbox of causality to identify persistent challenges.<n>Inheriting from the literature of causal inference, we identify key assumptions necessary for reliable generalisation.<n>We illustrate failure modes of naive reward models and demonstrate how causally-inspired approaches can improve model robustness.
arXiv Detail & Related papers (2025-06-06T10:45:42Z) - CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives [3.7931130268412194]
CLASH is a dataset consisting of 345 high-impact dilemmas along with 3,795 individual perspectives of diverse values.<n> CLASH enables the study of critical yet underexplored aspects of value-based decision-making processes.<n>Even strong proprietary models, such as GPT-5 and Claude-4-Sonnet, struggle with ambivalent decisions.
arXiv Detail & Related papers (2025-04-15T02:54:16Z) - Value FULCRA: Mapping Large Language Models to the Multidimensional
Spectrum of Basic Human Values [47.779186412943076]
We propose a novel basic value alignment paradigm and a value space spanned by basic value dimensions.
Inspired by basic values in humanity and social science across cultures, this work proposes a novel basic value alignment paradigm and a value space spanned by basic value dimensions.
To foster future research, we apply the representative Schwartz's Theory of Basic Values as an example and construct FULCRA, a dataset consisting of 5k (LLM output, value vector) pairs.
arXiv Detail & Related papers (2023-11-15T10:29:28Z) - Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties [68.66719970507273]
Value pluralism is the view that multiple correct values may be held in tension with one another.
As statistical learners, AI systems fit to averages by default, washing out potentially irreducible value conflicts.
We introduce ValuePrism, a large-scale dataset of 218k values, rights, and duties connected to 31k human-written situations.
arXiv Detail & Related papers (2023-09-02T01:24:59Z) - From Instructions to Intrinsic Human Values -- A Survey of Alignment
Goals for Big Models [48.326660953180145]
We conduct a survey of different alignment goals in existing work and trace their evolution paths to help identify the most essential goal.
Our analysis reveals a goal transformation from fundamental abilities to value orientation, indicating the potential of intrinsic human values as the alignment goal for enhanced LLMs.
arXiv Detail & Related papers (2023-08-23T09:11:13Z) - Heterogeneous Value Alignment Evaluation for Large Language Models [91.96728871418]
Large Language Models (LLMs) have made it crucial to align their values with those of humans.
We propose a Heterogeneous Value Alignment Evaluation (HVAE) system to assess the success of aligning LLMs with heterogeneous values.
arXiv Detail & Related papers (2023-05-26T02:34:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.