Related papers: On-the-fly Preference Alignment via Principle-Guided Decoding

On-the-fly Preference Alignment via Principle-Guided Decoding

URL: http://arxiv.org/abs/2502.14204v1
Date: Thu, 20 Feb 2025 02:23:09 GMT
Title: On-the-fly Preference Alignment via Principle-Guided Decoding
Authors: Mingye Zhu, Yi Liu, Lei Zhang, Junbo Guo, Zhendong Mao,
Abstract summary: We introduce On-the-fly Preference Alignment via Principle-Guided Decoding (OPAD) to align model outputs with human preferences during inference.<n>OPAD achieves competitive or superior performance in both general and personalized alignment tasks.
Score: 27.50204023448716
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the rapidly expanding landscape of large language models, aligning model generations with human values and preferences is becoming increasingly important. Popular alignment methods, such as Reinforcement Learning from Human Feedback, have shown significant success in guiding models with greater control. However, these methods require considerable computational resources, which is inefficient, and substantial collection of training data to accommodate the diverse and pluralistic nature of human preferences, which is impractical. These limitations significantly constrain the scope and efficacy of both task-specific and general preference alignment methods. In this work, we introduce On-the-fly Preference Alignment via Principle-Guided Decoding (OPAD) to directly align model outputs with human preferences during inference, eliminating the need for fine-tuning. Our approach involves first curating a surrogate solution to an otherwise infeasible optimization problem and then designing a principle-guided reward function based on this surrogate. The final aligned policy is derived by maximizing this customized reward, which exploits the discrepancy between the constrained policy and its unconstrained counterpart. OPAD directly modifies the model's predictions during inference, ensuring principle adherence without incurring the computational overhead of retraining or fine-tuning. Experiments show that OPAD achieves competitive or superior performance in both general and personalized alignment tasks, demonstrating its efficiency and effectiveness compared to state-of-the-art baselines.

Related papers

SGPO: Self-Generated Preference Optimization based on Self-Improver [6.528083376369728]
Large language models (LLMs) require alignment to human preferences for practical and reliable deployment.<n>We propose Self-Generated Preference Optimization based on Self-Improver (SGPO)<n>The improver refines responses from a policy model to self-generate preference data for direct preference optimization (DPO) of the policy model.<n> Experimental results on AlpacaEval 2.0 and Arena-Hard show that the proposed SGPO significantly improves performance over DPO and baseline self-improving methods.
arXiv Detail & Related papers (2025-07-27T08:55:40Z)
Mitigating Reward Over-optimization in Direct Alignment Algorithms with Importance Sampling [13.917799959981185]
Direct Alignment Algorithms (DAAs) have emerged as alternatives to the standard Reinforcement Learning from Human Feedback (RLHF)<n>These methods are more susceptible to over-optimization, in which the model drifts away from the reference policy, leading to degraded performance as training progresses.<n>This paper proposes a novel importance-sampling approach to mitigate the over-optimization problem of offline DAAs.
arXiv Detail & Related papers (2025-06-10T10:45:26Z)
Token-Importance Guided Direct Preference Optimization [2.230951739798399]
We propose a Token-Importance Guided Direct Preference Optimization (TI-DPO) to ensure that large language models generate outputs aligned with human preferences.<n> Experimental results show that TI-DPO achieves higher accuracy and stronger generative diversity, providing more stable and computationally efficient solutions.
arXiv Detail & Related papers (2025-05-26T08:11:24Z)
Latent Embedding Adaptation for Human Preference Alignment in Diffusion Planners [16.863492060519157]
This work addresses the challenge of personalizing trajectories generated in automated decision-making systems. We propose a resource-efficient approach that enables rapid adaptation to individual users' preferences.
arXiv Detail & Related papers (2025-03-24T05:11:58Z)
Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes [50.544186914115045]
Large language models (LLMs) are increasingly embedded in everyday applications. Ensuring their alignment with the diverse preferences of individual users has become a critical challenge. We present a novel framework for few-shot steerable alignment.
arXiv Detail & Related papers (2024-12-18T16:14:59Z)
Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC. We increase the consistency and informativeness of the pairwise preference signals through targeted modifications. We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z)
Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment [58.049113055986375]
We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF) to train reward models and the policy.<n>The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms.<n>We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo.
arXiv Detail & Related papers (2024-06-11T01:20:53Z)
Adaptive Preference Scaling for Reinforcement Learning with Human Feedback [103.36048042664768]
Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values. We propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO) Our method is versatile and can be readily adapted to various preference optimization frameworks.
arXiv Detail & Related papers (2024-06-04T20:33:22Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences. To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model. Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback [70.32795295142648]
Linear alignment is a novel algorithm that aligns language models with human preferences in one single inference step. Experiments on both general and personalized preference datasets demonstrate that linear alignment significantly enhances the performance and efficiency of LLM alignment.
arXiv Detail & Related papers (2024-01-21T10:46:23Z)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z)
Self-Supervised Primal-Dual Learning for Constrained Optimization [19.965556179096385]
This paper studies how to train machine-learning models that directly approximate the optimal solutions of constrained optimization problems. It proposes the idea of Primal-Dual Learning (PDL), a self-supervised training method that does not require a set of pre-solved instances or an optimization solver for training and inference.
arXiv Detail & Related papers (2022-08-18T20:07:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.