The Crucial Role of Samplers in Online Direct Preference Optimization
- URL: http://arxiv.org/abs/2409.19605v2
- Date: Sun, 6 Oct 2024 02:22:10 GMT
- Title: The Crucial Role of Samplers in Online Direct Preference Optimization
- Authors: Ruizhe Shi, Runlong Zhou, Simon S. Du,
- Abstract summary: Direct Preference Optimization (DPO) has emerged as a stable, scalable, and efficient solution for language model alignment.
We provide a rigorous analysis of DPO's $textitconvergence rates$ with different sampling strategies under the exact gradient setting.
Our results offer insights into the theoretical standing of DPO and also pave the way for potential algorithm designs.
- Score: 36.68862142959827
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Direct Preference Optimization (DPO) has emerged as a stable, scalable, and efficient solution for language model alignment. Despite its empirical success, the $\textit{optimization}$ properties, particularly the impact of samplers on its convergence rates, remain underexplored. In this paper, we provide a rigorous analysis of DPO's $\textit{convergence rates}$ with different sampling strategies under the exact gradient setting, revealing a surprising separation: uniform sampling achieves $\textit{linear}$ convergence, while our proposed online sampler achieves $\textit{quadratic}$ convergence. We further adapt the sampler to practical settings by incorporating posterior distributions and $\textit{logit mixing}$, demonstrating significant improvements over previous approaches. On Safe-RLHF dataset, our method exhibits a $4.5$% improvement over vanilla DPO and a $3.0$% improvement over on-policy DPO; on Iterative-Prompt, our approach outperforms vanilla DPO, on-policy DPO, and Hybrid GSHF by over $4.2$%. Our results not only offer insights into the theoretical standing of DPO but also pave the way for potential algorithm designs in the future.
Related papers
- Finding the Sweet Spot: Preference Data Construction for Scaling Preference Optimization [66.67988187816185]
We aim to emphscale up the number of on-policy samples via repeated random sampling to improve alignment performance.
Our experiments reveal that this strategy leads to a emphdecline in performance as the sample size increases.
We introduce a scalable preference data construction strategy that consistently enhances model performance as the sample scale increases.
arXiv Detail & Related papers (2025-02-24T04:22:57Z) - Direct Distributional Optimization for Provable Alignment of Diffusion Models [39.048284342436666]
We introduce a novel alignment method for diffusion models from distribution optimization perspectives.
We first formulate the problem as a generic regularized loss minimization over probability distributions.
We enable sampling from the learned distribution by approximating its score function via Doob's $h$-transform technique.
arXiv Detail & Related papers (2025-02-05T07:35:15Z) - Calibrated Multi-Preference Optimization for Aligning Diffusion Models [92.90660301195396]
Calibrated Preference Optimization (CaPO) is a novel method to align text-to-image (T2I) diffusion models.
CaPO incorporates the general preference from multiple reward models without human annotated data.
Experimental results show that CaPO consistently outperforms prior methods.
arXiv Detail & Related papers (2025-02-04T18:59:23Z) - Distributionally Robust Direct Preference Optimization [15.328510632723505]
A major challenge in aligning large language models with human preferences is the issue of distribution shift.
We develop two novel distributionally robust direct preference optimization algorithms, namely, Wasserstein DPO (WDPO) and Kullback-Leibler DPO (KLDPO)
Our empirical experiments demonstrate the superior performance of WDPO and KLDPO in substantially improving the alignment when there is a preference distribution shift.
arXiv Detail & Related papers (2025-02-04T02:03:19Z) - Refining Alignment Framework for Diffusion Models with Intermediate-Step Preference Ranking [50.325021634589596]
We propose a Tailored Optimization Preference (TailorPO) framework for aligning diffusion models with human preference.
Our approach directly ranks intermediate noisy samples based on their step-wise reward, and effectively resolves the gradient direction issues.
Experimental results demonstrate that our method significantly improves the model's ability to generate aesthetically pleasing and human-preferred images.
arXiv Detail & Related papers (2025-02-01T16:08:43Z) - SWEPO: Simultaneous Weighted Preference Optimization for Group Contrastive Alignment [16.230186347702737]
We propose Simultaneous Weighted Preference Optimization (SWEPO)
SWEPO incorporates multiple responses per query and prioritizes those that deviate most from the average reward.
We prove that such multi-preference sampling lowers alignment bias, bounding the expected deviation from the true acceptable-response distribution at a rate of $mathcalO(tfrac1sqrtk)$.
arXiv Detail & Related papers (2024-12-05T21:50:22Z) - $f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization [91.43730624072226]
$f$-PO is a novel framework that generalizes and extends existing approaches.
We conduct experiments on state-of-the-art language models using benchmark datasets.
arXiv Detail & Related papers (2024-10-29T02:11:45Z) - Preference Optimization with Multi-Sample Comparisons [53.02717574375549]
We introduce a novel approach that extends post-training to include multi-sample comparisons.
These approaches fail to capture critical characteristics such as generative diversity and bias.
We demonstrate that multi-sample comparison is more effective in optimizing collective characteristics than single-sample comparison.
arXiv Detail & Related papers (2024-10-16T00:59:19Z) - $α$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs [45.46582930202524]
$alpha$-DPO is an adaptive preference optimization algorithm for large language models.
It balances the policy model and the reference model to achieve personalized reward margins.
It consistently outperforms DPO and SimPO across various model settings.
arXiv Detail & Related papers (2024-10-14T04:29:57Z) - Minor DPO reject penalty to increase training robustness [8.971332948872185]
Learning from human preference is a paradigm used in large-scale language model (LLM) fine-tuning step to better align pretrained LLM to human preference for downstream task.
Recently, Direct Preference Optimization (DPO) has been proposed to solve the alignment problem with a simplified RL-free method.
In this article, we analyze the working mechanism of $beta$ in DPO, disclose its syntax difference between RL algorithm and DPO, and understand the potential shortage brought by the DPO simplification.
arXiv Detail & Related papers (2024-08-19T09:29:31Z) - Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.
We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.
We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z) - Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization [78.82586283794886]
We present a new offline alignment algorithm, $chi2$-Preference Optimization ($chi$PO)
$chi$PO implements the principle of pessimism in the face of uncertainty via regularization.
It is provably robust to overoptimization and achieves sample-complexity guarantees based on single-policy concentrability.
arXiv Detail & Related papers (2024-07-18T11:08:40Z) - $β$-DPO: Direct Preference Optimization with Dynamic $β$ [45.63597733177275]
Direct Preference Optimization (DPO) has emerged as a compelling approach for training Large Language Models (LLMs) to adhere to human preferences.
We analyze the impact of $beta$ and data quality on DPO, uncovering that optimal $beta$ values vary with the informativeness of pairwise data.
We introduce a novel framework that dynamically calibrates $beta$ at the batch level, informed by data quality considerations.
arXiv Detail & Related papers (2024-07-11T16:21:18Z) - D2PO: Discriminator-Guided DPO with Response Evaluation Models [63.71853401569461]
We propose D2PO, discriminator-guided DPO, for the online setting where preferences are being collected throughout learning.
As we collect gold preferences, we use these not only to train our policy, but to train a discriminative response evaluation model to silver-label even more synthetic data for policy training.
We show conditions under which silver labeling is most helpful: it is most effective when training the policy with DPO, outperforming traditional PPO, and benefits from maintaining a separate discriminator from the policy model.
arXiv Detail & Related papers (2024-05-02T17:44:41Z) - Statistical Rejection Sampling Improves Preference Optimization [42.57245965632205]
We introduce a novel approach to source preference data from the target optimal policy using rejection sampling.
We also propose a unified framework that enhances the loss functions used in both Sequence Likelihood (SLiC) and Direct Preference Optimization (DPO) from a preference modeling standpoint.
arXiv Detail & Related papers (2023-09-13T01:07:25Z) - Sample Dropout: A Simple yet Effective Variance Reduction Technique in
Deep Policy Optimization [18.627233013208834]
We show that the use of importance sampling could introduce high variance in the objective estimate.
We propose a technique called sample dropout to bound the estimation variance by dropping out samples when their ratio deviation is too high.
arXiv Detail & Related papers (2023-02-05T04:44:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.