Active Learning for Direct Preference Optimization
- URL: http://arxiv.org/abs/2503.01076v1
- Date: Mon, 03 Mar 2025 00:36:31 GMT
- Title: Active Learning for Direct Preference Optimization
- Authors: Branislav Kveton, Xintong Li, Julian McAuley, Ryan Rossi, Jingbo Shang, Junda Wu, Tong Yu,
- Abstract summary: Direct preference optimization (DPO) is a form of reinforcement learning from human feedback.<n>We propose an active learning framework for DPO, which can be applied to collect human feedback online or to choose the most informative subset of already collected feedback offline.
- Score: 59.84525302418018
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Direct preference optimization (DPO) is a form of reinforcement learning from human feedback (RLHF) where the policy is learned directly from preferential feedback. Although many models of human preferences exist, the critical task of selecting the most informative feedback for training them is under-explored. We propose an active learning framework for DPO, which can be applied to collect human feedback online or to choose the most informative subset of already collected feedback offline. We propose efficient algorithms for both settings. The key idea is to linearize the DPO objective at the last layer of the neural network representation of the optimized policy and then compute the D-optimal design to collect preferential feedback. We prove that the errors in our DPO logit estimates diminish with more feedback. We show the effectiveness of our algorithms empirically in the setting that matches our theory and also on large language models.
Related papers
- Sharpe Ratio-Guided Active Learning for Preference Optimization in RLHF [67.48004037550064]
We propose an active learning approach to efficiently select prompt and preference pairs.
Our method evaluates the gradients of all potential preference annotations to assess their impact on model updates.
Experimental results demonstrate that our method outperforms the baseline by up to 5% in win rates against the chosen completion.
arXiv Detail & Related papers (2025-03-28T04:22:53Z) - A Survey of Direct Preference Optimization [103.59317151002693]
Large Language Models (LLMs) have demonstrated unprecedented generative capabilities.
Their alignment with human values remains critical for ensuring helpful and harmless deployments.
Direct Preference Optimization (DPO) has recently gained prominence as a streamlined alternative.
arXiv Detail & Related papers (2025-03-12T08:45:15Z) - PILAF: Optimal Human Preference Sampling for Reward Modeling [14.336058926701432]
We propose Policy-Interpolated Learning for Aligned Feedback (PILAF), a novel response sampling strategy for preference labeling.<n>PILAF explicitly aligns preference learning with maximizing the underlying oracle reward.
arXiv Detail & Related papers (2025-02-06T18:09:00Z) - Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss.
The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z) - Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback [110.16220825629749]
Learning from preference feedback has emerged as an essential step for improving the generation quality and performance of modern language models.
In this work, we identify four core aspects of preference-based learning: preference data, learning algorithm, reward model, and policy training prompts.
Our findings indicate that all aspects are important for performance, with better preference data leading to the largest improvements.
arXiv Detail & Related papers (2024-06-13T16:17:21Z) - Active Preference Learning for Large Language Models [12.093302163058436]
We develop an active learning strategy for DPO to make better use of preference labels.
We propose a practical acquisition function for prompt/completion pairs based on the predictive entropy of the language model.
We demonstrate how our approach improves both the rate of learning and final performance of fine-tuning on pairwise preference data.
arXiv Detail & Related papers (2024-02-12T23:09:00Z) - Towards Efficient Exact Optimization of Language Model Alignment [93.39181634597877]
Direct preference optimization (DPO) was proposed to directly optimize the policy from preference data.
We show that DPO derived based on the optimal solution of problem leads to a compromised mean-seeking approximation of the optimal solution in practice.
We propose efficient exact optimization (EXO) of the alignment objective.
arXiv Detail & Related papers (2024-02-01T18:51:54Z) - Sample Efficient Preference Alignment in LLMs via Active Exploration [63.84454768573154]
We take advantage of the fact that one can often choose contexts at which to obtain human feedback to most efficiently identify a good policy.
We propose an active exploration algorithm to efficiently select the data and provide theoretical proof that it has a worst-case regret bound.
Our method outperforms the baselines with limited samples of human preferences on several language models and four real-world datasets.
arXiv Detail & Related papers (2023-12-01T00:54:02Z) - Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories.
We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z) - Beyond Reward: Offline Preference-guided Policy Optimization [18.49648170835782]
offline preference-based reinforcement learning (PbRL) is a variant of conventional reinforcement learning that dispenses with the need for online interaction.
This study focuses on the topic of offline preference-guided policy optimization (OPPO)
OPPO models offline trajectories and preferences in a one-step process, eliminating the need for separately learning a reward function.
arXiv Detail & Related papers (2023-05-25T16:24:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.