$β$-DPO: Direct Preference Optimization with Dynamic $β$
- URL: http://arxiv.org/abs/2407.08639v2
- Date: Sun, 13 Oct 2024 08:53:00 GMT
- Title: $β$-DPO: Direct Preference Optimization with Dynamic $β$
- Authors: Junkang Wu, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, Xiangnan He,
- Abstract summary: Direct Preference Optimization (DPO) has emerged as a compelling approach for training Large Language Models (LLMs) to adhere to human preferences.
We analyze the impact of $beta$ and data quality on DPO, uncovering that optimal $beta$ values vary with the informativeness of pairwise data.
We introduce a novel framework that dynamically calibrates $beta$ at the batch level, informed by data quality considerations.
- Score: 45.63597733177275
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Direct Preference Optimization (DPO) has emerged as a compelling approach for training Large Language Models (LLMs) to adhere to human preferences. However, the performance of DPO is sensitive to the fine-tuning of its trade-off parameter $\beta$, as well as to the quality of the preference data. We analyze the impact of $\beta$ and data quality on DPO, uncovering that optimal $\beta$ values vary with the informativeness of pairwise data. Addressing the limitations of static $\beta$ values, we introduce a novel framework that dynamically calibrates $\beta$ at the batch level, informed by data quality considerations. Additionally, our method incorporates $\beta$-guided data filtering to safeguard against the influence of outliers. Through empirical evaluation, we demonstrate that our dynamic $\beta$ adjustment technique significantly improves DPO's performance across a range of models and datasets, offering a more robust and adaptable training paradigm for aligning LLMs with human feedback. The code is available at \url{https://github.com/junkangwu/beta-DPO}.
Related papers
- $f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization [91.43730624072226]
$f$-PO is a novel framework that generalizes and extends existing approaches.
We conduct experiments on state-of-the-art language models using benchmark datasets.
arXiv Detail & Related papers (2024-10-29T02:11:45Z) - Aligning Large Language Models via Self-Steering Optimization [78.42826116686435]
We introduce Self-Steering Optimization ($SSO$), an algorithm that autonomously generates high-quality preference signals.
$SSO$ maintains the accuracy of signals by ensuring a consistent gap between chosen and rejected responses.
We validate the effectiveness of $SSO$ with two foundation models, Qwen2 and Llama3.1, indicating that it provides accurate, on-policy preference signals.
arXiv Detail & Related papers (2024-10-22T16:04:03Z) - The Crucial Role of Samplers in Online Direct Preference Optimization [36.68862142959827]
Direct Preference Optimization (DPO) has emerged as a stable, scalable, and efficient solution for language model alignment.
We provide a rigorous analysis of DPO's $textitconvergence rates$ with different sampling strategies under the exact gradient setting.
Our results offer insights into the theoretical standing of DPO and also pave the way for potential algorithm designs.
arXiv Detail & Related papers (2024-09-29T07:53:50Z) - Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization [45.6430987775264]
This study addresses the challenge of noise in training datasets for Direct Preference Optimization (DPO)
We categorize noise into pointwise noise, which includes low-quality data points, and pairwise noise, which encompasses erroneous data pair associations.
We introduce Distributionally Robustifying DPO, which integrates pairwise robustness by optimizing against worst-case pairwise scenarios.
arXiv Detail & Related papers (2024-07-10T17:48:25Z) - Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data.
Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - Provably Robust DPO: Aligning Language Models with Noisy Feedback [10.523790076060171]
We introduce a general framework for policy optimization in the presence of random preference flips.
We design a novel loss function, which de-bias the effect of noise on average, making a policy trained by minimizing that loss robust to the noise.
Our experiments on IMDb sentiment generation and Anthropic's helpful-harmless dataset show that rDPO is robust to noise in preference labels compared to vanilla DPO.
arXiv Detail & Related papers (2024-03-01T09:55:18Z) - Active Preference Optimization for Sample Efficient RLHF [27.772423917657626]
Reinforcement Learning from Human Feedback (RLHF) is pivotal in aligning Large Language Models with human preferences.
Current methods rely on uniformly picking prompt-generation pairs from a dataset of prompt-generations.
We develop an active-learning algorithm, $textttAPO$, which enhances model alignment by querying preference data.
arXiv Detail & Related papers (2024-02-16T08:19:34Z) - Reinforcement Learning from Human Feedback with Active Queries [67.27150911254155]
Current reinforcement learning approaches often require a large amount of human-labelled preference data.
We propose query-efficient RLHF methods, inspired by the success of active learning.
Our experiments show that ADPO, while only making about half of queries for human preference, matches the performance of the state-of-the-art DPO method.
arXiv Detail & Related papers (2024-02-14T18:58:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.