Related papers: VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment

VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment

URL: http://arxiv.org/abs/2603.04822v1
Date: Thu, 05 Mar 2026 05:12:26 GMT
Title: VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment
Authors: Jiawei Chen, Tianzhuo Yang, Guoxi Zhang, Jiaming Ji, Yaodong Yang, Juntao Dai,
Abstract summary: We propose a closed-loop framework designed to navigate the trade-off between fine-tuning and Aligning Large Language Models (LLMs)<n> VISA features a high-precision value detector, a semantic-to-value translator, and a core value-rewriter.<n>Our experiments demonstrate that this approach enables precise control over a model's value expression while maintaining its factual consistency and general capabilities.
Score: 24.492954219955788
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Aligning Large Language Models (LLMs) with nuanced human values remains a critical challenge, as existing methods like Reinforcement Learning from Human Feedback (RLHF) often handle only coarse-grained attributes. In practice, fine-tuning LLMs on task-specific datasets to optimize value alignment inevitably incurs an alignment tax: the model's pre-calibrated value system drifts significantly due to latent bias absorption from training data, while the fine-tuning process also causes severe hallucinations and semantic information loss in generated responses. To address this, we propose VISA (Value Injection via Shielded Adaptation), a closed-loop framework designed to navigate this trade-off. VISA's architecture features a high-precision value detector, a semantic-to-value translator, and a core value-rewriter. The value-rewriter is trained via Group Relative Policy Optimization (GRPO) with a composite reward function that simultaneously optimizes for fine-grained value precision, and the preservation of semantic integrity. By learning an optimal policy to balance these competing objectives, VISA effectively mitigates the alignment tax while staying loyal to the original knowledge. Our experiments demonstrate that this approach enables precise control over a model's value expression while maintaining its factual consistency and general capabilities, significantly outperforming both standard fine-tuning methods and prompting-based baselines, including GPT-4o.

Related papers

MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization [56.074760766965085]
Group-Relative Policy Optimization has emerged as an efficient paradigm for aligning Large Language Models (LLMs)<n>We propose MAESTRO, which treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck.<n>We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal.
arXiv Detail & Related papers (2026-01-12T05:02:48Z)
Every Question Has Its Own Value: Reinforcement Learning with Explicit Human Values [53.72318444646282]
We propose Reinforcement Learning with Explicit Human Values (RLEV)<n>RLEV aligns Large Language Model (LLM) optimization directly with quantifiable human value signals.<n>We show RLEV consistently outperforms correctness-only baselines across multiple RL algorithms and model scales.
arXiv Detail & Related papers (2025-10-23T04:15:22Z)
Internal Value Alignment in Large Language Models through Controlled Value Vector Activation [70.41805604556058]
We introduce a Controlled Value Vector Activation (ConVA) method to align Large Language Models (LLMs) with human values.<n>To consistently control values without sacrificing model performance, we introduce a gated value vector activation method.<n>Experiments show that our method achieves the highest control success rate across 10 basic values without hurting LLM performance and fluency.
arXiv Detail & Related papers (2025-07-15T13:48:35Z)
Evolutionary Guided Decoding: Iterative Value Refinement for LLMs [41.56764640311065]
Iterative Value Refinement is a novel framework designed to bridge this gap.<n>It employs Value Exploration to provide a more comprehensive and robust training signal.<n>Iterative Self-Refinement uses the improved value function from one iteration to guide the generation of higher-quality data.
arXiv Detail & Related papers (2025-03-04T07:49:10Z)
Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance [52.65461207786633]
Policy-based Reinforcement Learning from Human Feedback is essential for aligning large language models with human preferences.<n>It requires joint training of an actor and critic with a pretrained, fixed reward model for guidance.<n>We propose textbfDecoupled Value Policy Optimization (DVPO), a lean framework that replaces traditional reward modeling with a pretrained emphglobal value model (GVM)
arXiv Detail & Related papers (2025-02-24T08:11:33Z)
Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data. Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z)
Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF [80.32171988565999]
We introduce a unified approach to online and offline RLHF -- value-incentivized preference optimization (VPO)<n>VPO regularizes the maximum-likelihood estimate of the reward function with the corresponding value function.<n>Experiments on text summarization and dialog verify the practicality and effectiveness of VPO.
arXiv Detail & Related papers (2024-05-29T17:51:42Z)
$i$REPO: $i$mplicit Reward Pairwise Difference based Empirical Preference Optimization [12.266207199002604]
Large Language Models (LLM) can sometimes produce outputs that deviate from human expectations. We propose a novel framework named $i$REPO, which utilizes implicit Reward pairwise difference regression for Empirical Preference Optimization. We show that $i$REPO effectively achieves self-alignment using soft-label, self-generated responses and the logit of empirical AI annotators.
arXiv Detail & Related papers (2024-05-24T05:42:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.