VPO: Leveraging the Number of Votes in Preference Optimization
- URL: http://arxiv.org/abs/2410.22891v1
- Date: Wed, 30 Oct 2024 10:39:34 GMT
- Title: VPO: Leveraging the Number of Votes in Preference Optimization
- Authors: Jae Hyeon Cho, Minkyung Park, Byung-Jun Lee,
- Abstract summary: We introduce a technique that leverages user voting data to better align with diverse subjective preferences.
We develop the Vote-based Preference Optimization framework, which incorporates the number of votes on both sides to distinguish between controversial and obvious generation pairs.
- Score: 5.200545764106177
- License:
- Abstract: Direct Preference Optimization (DPO) trains a language model using human preference data, bypassing the explicit reward modeling phase of Reinforcement Learning from Human Feedback (RLHF). By iterating over sentence pairs in a preference dataset, DPO enhances generation quality by increasing the likelihood of producing preferred sentences over less favored ones. Preference datasets are typically created by selecting preferred sentences through a voting process involving multiple individuals, as opinions can vary due to the subjective nature of human preferences. While the number of votes offers insight into whether a sentence pair is clearly preferable or controversial, current methods do not fully leverage this information. In this paper, we introduce a technique that leverages user voting data to better align with diverse subjective preferences. We employ the Bayesian Minimum Mean Square Error (Bayesian MMSE) estimator to model the probability that one generation is preferable to another. Using this estimated probability as a target, we develop the Vote-based Preference Optimization (VPO) framework, which incorporates the number of votes on both sides to distinguish between controversial and obvious generation pairs. We show that previous algorithms, such as DPO and Identity Preference Optimization (IPO), can be extended using the proposed framework, termed VDPO and VIPO. Our experiments demonstrate that these proposed algorithms outperform various existing methods, including their base algorithms.
Related papers
- Calibrated Multi-Preference Optimization for Aligning Diffusion Models [92.90660301195396]
Calibrated Preference Optimization (CaPO) is a novel method to align text-to-image (T2I) diffusion models.
CaPO incorporates the general preference from multiple reward models without human annotated data.
Experimental results show that CaPO consistently outperforms prior methods.
arXiv Detail & Related papers (2025-02-04T18:59:23Z) - Personalized Preference Fine-tuning of Diffusion Models [75.22218338096316]
We introduce PPD, a multi-reward optimization objective that aligns diffusion models with personalized preferences.
With PPD, a diffusion model learns the individual preferences of a population of users in a few-shot way.
Our approach achieves an average win rate of 76% over Stable Cascade, generating images that more accurately reflect specific user preferences.
arXiv Detail & Related papers (2025-01-11T22:38:41Z) - ComPO: Community Preferences for Language Model Personalization [122.54846260663922]
ComPO is a method to personalize preference optimization in language models.
We collect and release ComPRed, a question answering dataset with community-level preferences from Reddit.
arXiv Detail & Related papers (2024-10-21T14:02:40Z) - TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees [14.84379332031731]
We introduce Tree Preference Optimization (TPO), that does not sample paired preference responses from the preference tree.
TPO formulates the language model alignment as a Preference List Ranking problem, where the policy can learn more effectively from a ranked preference list of responses.
arXiv Detail & Related papers (2024-10-10T22:22:05Z) - Accelerated Preference Optimization for Large Language Model Alignment [60.22606527763201]
Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences.
Direct Preference Optimization (DPO) formulates RLHF as a policy optimization problem without explicitly estimating the reward function.
We propose a general Accelerated Preference Optimization (APO) framework, which unifies many existing preference optimization algorithms.
arXiv Detail & Related papers (2024-10-08T18:51:01Z) - Ordinal Preference Optimization: Aligning Human Preferences via NDCG [28.745322441961438]
We develop an end-to-end preference optimization algorithm by approxing NDCG with a differentiable surrogate loss.
OPO outperforms existing pairwise and listwise approaches on evaluation sets and general benchmarks like AlpacaEval.
arXiv Detail & Related papers (2024-10-06T03:49:28Z) - Preference Optimization as Probabilistic Inference [21.95277469346728]
We propose a method that can leverage unpaired preferred or dis-preferred examples, and works even when only one type of feedback is available.
This flexibility allows us to apply it in scenarios with varying forms of feedback and models, including training generative language models.
arXiv Detail & Related papers (2024-10-05T14:04:03Z) - Adaptive Preference Scaling for Reinforcement Learning with Human Feedback [103.36048042664768]
Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values.
We propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO)
Our method is versatile and can be readily adapted to various preference optimization frameworks.
arXiv Detail & Related papers (2024-06-04T20:33:22Z) - Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization [105.3612692153615]
We propose a new axis based on eliciting preferences jointly over instruction-response pairs.
Joint preferences over instruction and response pairs can significantly enhance the alignment of large language models.
arXiv Detail & Related papers (2024-03-31T02:05:40Z) - Preference as Reward, Maximum Preference Optimization with Importance Sampling [3.7040071165219595]
We propose a simple and intuitive off-policy preference optimization algorithm from an importance sampling view, which we call Maximum Preference Optimization (MPO)
MPO achieves the best of both worlds by combining the objectives of RLHF and IPO while being an off-policy algorithm.
arXiv Detail & Related papers (2023-12-27T06:34:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.