Value Alignment Tax: Measuring Value Trade-offs in LLM Alignment
- URL: http://arxiv.org/abs/2602.12134v1
- Date: Thu, 12 Feb 2026 16:21:22 GMT
- Title: Value Alignment Tax: Measuring Value Trade-offs in LLM Alignment
- Authors: Jiajun Chen, Hua Shen,
- Abstract summary: We introduce the Value Alignment Tax (VAT), a framework that measures how alignment-induced changes propagate across values.<n>VAT captures the dynamics of value expression under alignment pressure.<n>Our results show that alignment often produces uneven, structured co-movement among values.
- Score: 16.1422306417719
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing work on value alignment typically characterizes value relations statically, ignoring how interventions - such as prompting, fine-tuning, or preference optimization - reshape the broader value system. We introduce the Value Alignment Tax (VAT), a framework that measures how alignment-induced changes propagate across interconnected values relative to achieved on-target gain. VAT captures the dynamics of value expression under alignment pressure. Using a controlled scenario-action dataset grounded in Schwartz value theory, we collect paired pre-post normative judgments and analyze alignment effects across models, values, and alignment strategies. Our results show that alignment often produces uneven, structured co-movement among values. These effects are invisible under conventional target-only evaluation, revealing systemic, process-level alignment risks and offering new insights into the dynamics of value alignment in LLMs.
Related papers
- VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment [24.492954219955788]
We propose a closed-loop framework designed to navigate the trade-off between fine-tuning and Aligning Large Language Models (LLMs)<n> VISA features a high-precision value detector, a semantic-to-value translator, and a core value-rewriter.<n>Our experiments demonstrate that this approach enables precise control over a model's value expression while maintaining its factual consistency and general capabilities.
arXiv Detail & Related papers (2026-03-05T05:12:26Z) - Controllable Value Alignment in Large Language Models through Neuron-Level Editing [87.83756695719667]
We propose NeVA, a neuron-level editing framework for controllable value alignment in large language models.<n>NeVA achieves stronger target value alignment while incurring smaller performance degradation on general capability.<n>NeVA significantly reduces the average leakage, with residual effects largely confined to semantically related value classes.
arXiv Detail & Related papers (2026-02-07T04:35:16Z) - Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics [81.80010043113445]
Local weight fine-tuning, LoRA-based adaptation, and activation-based interventions are studied in isolation.<n>We present a unified view that frames these interventions as dynamic weight updates induced by a control signal.<n>Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility.
arXiv Detail & Related papers (2026-02-02T17:04:36Z) - MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization [56.074760766965085]
Group-Relative Policy Optimization has emerged as an efficient paradigm for aligning Large Language Models (LLMs)<n>We propose MAESTRO, which treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck.<n>We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal.
arXiv Detail & Related papers (2026-01-12T05:02:48Z) - The Realignment Problem: When Right becomes Wrong in LLMs [6.8304813545377]
The alignment of Large Language Models with human values is central to their safe deployment, yet current models fail to keep pace with evolving norms and policies.<n>Existing unlearning methods act as blunt instruments that erode utility rather than enable precise policy updates.<n>We introduce TRACE, a framework for principled unlearning that reconceives realignment as a programmatic policy problem.
arXiv Detail & Related papers (2025-11-04T14:52:58Z) - NPO: Learning Alignment and Meta-Alignment through Structured Human Feedback [0.0]
We present NPO, an alignment-aware learning framework that operationalizes feedback-driven adaptation in human-in-the-loop decision systems.<n>NPO introduces a formalization of alignment loss that is measurable, supervisable, and reducible under structured feedback.
arXiv Detail & Related papers (2025-07-22T11:23:18Z) - Internal Value Alignment in Large Language Models through Controlled Value Vector Activation [70.41805604556058]
We introduce a Controlled Value Vector Activation (ConVA) method to align Large Language Models (LLMs) with human values.<n>To consistently control values without sacrificing model performance, we introduce a gated value vector activation method.<n>Experiments show that our method achieves the highest control success rate across 10 basic values without hurting LLM performance and fluency.
arXiv Detail & Related papers (2025-07-15T13:48:35Z) - Mind the Value-Action Gap: Do LLMs Act in Alignment with Their Values? [11.490681551032502]
"Value-Action Gap" reveals discrepancies between individuals' stated values and their actions in real-world contexts.<n>This study introduces ValueActionLens, an evaluation framework to assess the alignment between LLMs' stated values and their value-informed actions.
arXiv Detail & Related papers (2025-01-26T09:33:51Z) - Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment [103.12563033438715]
Alignment in artificial intelligence pursues consistency between model responses and human preferences as well as values.
Existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives.
We introduce controllable preference optimization (CPO), which explicitly specifies preference scores for different objectives.
arXiv Detail & Related papers (2024-02-29T12:12:30Z) - Heterogeneous Value Alignment Evaluation for Large Language Models [91.96728871418]
Large Language Models (LLMs) have made it crucial to align their values with those of humans.
We propose a Heterogeneous Value Alignment Evaluation (HVAE) system to assess the success of aligning LLMs with heterogeneous values.
arXiv Detail & Related papers (2023-05-26T02:34:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.