Multi-Value Alignment for LLMs via Value Decorrelation and Extrapolation
- URL: http://arxiv.org/abs/2511.17579v1
- Date: Sat, 15 Nov 2025 13:33:26 GMT
- Title: Multi-Value Alignment for LLMs via Value Decorrelation and Extrapolation
- Authors: Hefei Xu, Le Wu, Chen Cheng, Hao Liu,
- Abstract summary: We propose a novel framework called Multi-Value Alignment (MVA)<n>It mitigates alignment caused by parameter interference among diverse human values by minimizing their mutual information.<n>MVA consistently outperforms existing baselines in aligning LLMs with multiple human values.
- Score: 23.41040153806061
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rapid advancement of large language models (LLMs), aligning them with human values for safety and ethics has become a critical challenge. This problem is especially challenging when multiple, potentially conflicting human values must be considered and balanced. Although several variants of existing alignment methods (such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO)) have been proposed to address multi-value alignment, they suffer from notable limitations: 1) they are often unstable and inefficient in multi-value optimization; and 2) they fail to effectively handle value conflicts. As a result, these approaches typically struggle to achieve optimal trade-offs when aligning multiple values. To address this challenge, we propose a novel framework called Multi-Value Alignment (MVA). It mitigates alignment degradation caused by parameter interference among diverse human values by minimizing their mutual information. Furthermore, we propose a value extrapolation strategy to efficiently explore the Pareto frontier, thereby constructing a set of LLMs with diverse value preferences. Extensive experiments demonstrate that MVA consistently outperforms existing baselines in aligning LLMs with multiple human values.
Related papers
- Reward-free Alignment for Conflicting Objectives [12.275610380458119]
We propose a Reward-free Alignment framework for Conflicted Objectives (RACO)<n>RACO directly leverages pairwise preference data and resolves gradient conflicts via a novel clipped variant of conflict-averse gradient descent.<n>We provide convergence guarantees to Pareto-critical points that respect user-specified objective weights, and further show that clipping can strictly improve convergence rate in the two-objective setting.
arXiv Detail & Related papers (2026-02-02T18:59:52Z) - MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization [56.074760766965085]
Group-Relative Policy Optimization has emerged as an efficient paradigm for aligning Large Language Models (LLMs)<n>We propose MAESTRO, which treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck.<n>We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal.
arXiv Detail & Related papers (2026-01-12T05:02:48Z) - Pareto Multi-Objective Alignment for Language Models [7.9051473654430655]
Large language models (LLMs) are increasingly deployed in real-world applications that require careful balancing of multiple, often conflicting, objectives.<n>We propose a principled and computationally efficient algorithm designed explicitly for multi-objective alignment (MOA) in LLMs.<n>PAMA transforms multi-objective RLHF into a convex optimization with a closed-form solution, significantly enhancing scalability.
arXiv Detail & Related papers (2025-08-11T08:54:14Z) - PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization [33.60097751620483]
In-Context Learning has shown great potential for aligning Large Language Models (LLMs) with human values.<n>LLMs' comprehension of input prompts remains agnostic, limiting ICA's ability to address value tensions.<n>We propose PICACO, a novel pluralistic ICA method to address this problem.
arXiv Detail & Related papers (2025-07-22T15:14:56Z) - Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time [52.230936493691985]
We propose SITAlign, an inference framework that addresses the multifaceted nature of alignment by maximizing a primary objective while satisfying threshold-based constraints on secondary criteria.<n>We provide theoretical insights by deriving sub-optimality bounds of our satisficing based inference alignment approach.
arXiv Detail & Related papers (2025-05-29T17:56:05Z) - UC-MOA: Utility-Conditioned Multi-Objective Alignment for Distributional Pareto-Optimality [52.49062565901046]
Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone for aligning large language models with human values.<n>Existing approaches struggle to capture the multi-dimensional, distributional nuances of human preferences.<n>We introduce Utility-Conditioned Multi-Objective Alignment (UC-MOA), a novel framework that overcomes these limitations.
arXiv Detail & Related papers (2025-03-10T09:52:42Z) - Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes [50.544186914115045]
Large language models (LLMs) are increasingly embedded in everyday applications.<n> Ensuring their alignment with the diverse preferences of individual users has become a critical challenge.<n>We present a novel framework for few-shot steerable alignment.
arXiv Detail & Related papers (2024-12-18T16:14:59Z) - MAP: Multi-Human-Value Alignment Palette [22.74688073469946]
We develop a novel, first-principle approach called Multi-Human-Value Alignment Palette (MAP)
MAP navigates the alignment across multiple human values in a structured and reliable way.
We conduct a detailed theoretical analysis of MAP by quantifying the trade-offs between values, the sensitivity to constraints, the fundamental connection between multi-value alignment and sequential alignment, and proving that linear weighted rewards are sufficient for multi-value alignment.
arXiv Detail & Related papers (2024-10-24T23:16:39Z) - Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment [58.049113055986375]
We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF) to train reward models and the policy.<n>The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms.<n>We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo.
arXiv Detail & Related papers (2024-06-11T01:20:53Z) - Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment [103.12563033438715]
Alignment in artificial intelligence pursues consistency between model responses and human preferences as well as values.
Existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives.
We introduce controllable preference optimization (CPO), which explicitly specifies preference scores for different objectives.
arXiv Detail & Related papers (2024-02-29T12:12:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.