Related papers: Value Alignment Tax: Measuring Value Trade-offs in LLM Alignment

Value Alignment Tax: Measuring Value Trade-offs in LLM Alignment

URL: http://arxiv.org/abs/2602.12134v1
Date: Thu, 12 Feb 2026 16:21:22 GMT
Title: Value Alignment Tax: Measuring Value Trade-offs in LLM Alignment
Authors: Jiajun Chen, Hua Shen,
Abstract summary: We introduce the Value Alignment Tax (VAT), a framework that measures how alignment-induced changes propagate across values.<n>VAT captures the dynamics of value expression under alignment pressure.<n>Our results show that alignment often produces uneven, structured co-movement among values.
Score: 16.1422306417719
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing work on value alignment typically characterizes value relations statically, ignoring how interventions - such as prompting, fine-tuning, or preference optimization - reshape the broader value system. We introduce the Value Alignment Tax (VAT), a framework that measures how alignment-induced changes propagate across interconnected values relative to achieved on-target gain. VAT captures the dynamics of value expression under alignment pressure. Using a controlled scenario-action dataset grounded in Schwartz value theory, we collect paired pre-post normative judgments and analyze alignment effects across models, values, and alignment strategies. Our results show that alignment often produces uneven, structured co-movement among values. These effects are invisible under conventional target-only evaluation, revealing systemic, process-level alignment risks and offering new insights into the dynamics of value alignment in LLMs.

Related papers

VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment [24.492954219955788]
We propose a closed-loop framework designed to navigate the trade-off between fine-tuning and Aligning Large Language Models (LLMs)<n> VISA features a high-precision value detector, a semantic-to-value translator, and a core value-rewriter.<n>Our experiments demonstrate that this approach enables precise control over a model's value expression while maintaining its factual consistency and general capabilities.
arXiv Detail & Related papers (2026-03-05T05:12:26Z)
Controllable Value Alignment in Large Language Models through Neuron-Level Editing [87.83756695719667]
We propose NeVA, a neuron-level editing framework for controllable value alignment in large language models.<n>NeVA achieves stronger target value alignment while incurring smaller performance degradation on general capability.<n>NeVA significantly reduces the average leakage, with residual effects largely confined to semantically related value classes.
arXiv Detail & Related papers (2026-02-07T04:35:16Z)
Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics [81.80010043113445]
Local weight fine-tuning, LoRA-based adaptation, and activation-based interventions are studied in isolation.<n>We present a unified view that frames these interventions as dynamic weight updates induced by a control signal.<n>Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility.
arXiv Detail & Related papers (2026-02-02T17:04:36Z)
MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization [56.074760766965085]
Group-Relative Policy Optimization has emerged as an efficient paradigm for aligning Large Language Models (LLMs)<n>We propose MAESTRO, which treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck.<n>We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal.
arXiv Detail & Related papers (2026-01-12T05:02:48Z)
The Realignment Problem: When Right becomes Wrong in LLMs [6.8304813545377]
The alignment of Large Language Models with human values is central to their safe deployment, yet current models fail to keep pace with evolving norms and policies.<n>Existing unlearning methods act as blunt instruments that erode utility rather than enable precise policy updates.<n>We introduce TRACE, a framework for principled unlearning that reconceives realignment as a programmatic policy problem.
arXiv Detail & Related papers (2025-11-04T14:52:58Z)
NPO: Learning Alignment and Meta-Alignment through Structured Human Feedback [0.0]
We present NPO, an alignment-aware learning framework that operationalizes feedback-driven adaptation in human-in-the-loop decision systems.<n>NPO introduces a formalization of alignment loss that is measurable, supervisable, and reducible under structured feedback.
arXiv Detail & Related papers (2025-07-22T11:23:18Z)
Internal Value Alignment in Large Language Models through Controlled Value Vector Activation [70.41805604556058]
We introduce a Controlled Value Vector Activation (ConVA) method to align Large Language Models (LLMs) with human values.<n>To consistently control values without sacrificing model performance, we introduce a gated value vector activation method.<n>Experiments show that our method achieves the highest control success rate across 10 basic values without hurting LLM performance and fluency.
arXiv Detail & Related papers (2025-07-15T13:48:35Z)
Mind the Value-Action Gap: Do LLMs Act in Alignment with Their Values? [11.490681551032502]
"Value-Action Gap" reveals discrepancies between individuals' stated values and their actions in real-world contexts.<n>This study introduces ValueActionLens, an evaluation framework to assess the alignment between LLMs' stated values and their value-informed actions.
arXiv Detail & Related papers (2025-01-26T09:33:51Z)
Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment [103.12563033438715]
Alignment in artificial intelligence pursues consistency between model responses and human preferences as well as values. Existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives. We introduce controllable preference optimization (CPO), which explicitly specifies preference scores for different objectives.
arXiv Detail & Related papers (2024-02-29T12:12:30Z)
Heterogeneous Value Alignment Evaluation for Large Language Models [91.96728871418]
Large Language Models (LLMs) have made it crucial to align their values with those of humans. We propose a Heterogeneous Value Alignment Evaluation (HVAE) system to assess the success of aligning LLMs with heterogeneous values.
arXiv Detail & Related papers (2023-05-26T02:34:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.