Preference Optimization via Contrastive Divergence: Your Reward Model is Secretly an NLL Estimator
- URL: http://arxiv.org/abs/2502.04567v1
- Date: Thu, 06 Feb 2025 23:45:08 GMT
- Title: Preference Optimization via Contrastive Divergence: Your Reward Model is Secretly an NLL Estimator
- Authors: Zhuotong Chen, Fang Liu, Xuan Zhu, Yanjun Qi, Mohammad Ghavamzadeh,
- Abstract summary: We develop a novel PO framework that provides theoretical guidance to effectively sample dispreferred completions.
We then select contrastive divergence (CD) as sampling strategy, and propose a novel MC-PO algorithm.
OnMC-PO outperforms existing SOTA baselines, and OnMC-PO leads to further improvement.
- Score: 32.05337749590184
- License:
- Abstract: Existing studies on preference optimization (PO) have centered on constructing pairwise preference data following simple heuristics, such as maximizing the margin between preferred and dispreferred completions based on human (or AI) ranked scores. However, none of these heuristics has a full theoretical justification. In this work, we develop a novel PO framework that provides theoretical guidance to effectively sample dispreferred completions. To achieve this, we formulate PO as minimizing the negative log-likelihood (NLL) of a probability model and propose to estimate its normalization constant via a sampling strategy. As we will demonstrate, these estimative samples can act as dispreferred completions in PO. We then select contrastive divergence (CD) as the sampling strategy, and propose a novel MC-PO algorithm that applies the Monte Carlo (MC) kernel from CD to sample hard negatives w.r.t. the parameterized reward model. Finally, we propose the OnMC-PO algorithm, an extension of MC-PO to the online setting. On popular alignment benchmarks, MC-PO outperforms existing SOTA baselines, and OnMC-PO leads to further improvement.
Related papers
- Federated Fine-Tuning of Large Language Models: Kahneman-Tversky vs. Direct Preference Optimization [49.88778604259453]
We evaluate Kahneman-Tversky Optimization (KTO) as a fine-tuning method for large language models (LLMs) in federated learning (FL) settings.
In both its original (KTOO) and redistributed (KTOR) configurations, KTO consistently outperforms DPO across all benchmarks.
These findings establish KTO as a robust and scalable fine-tuning method for FL, motivating its adoption for privacy-preserving, decentralized, and heterogeneous environments.
arXiv Detail & Related papers (2025-02-20T01:44:21Z) - Step-level Value Preference Optimization for Mathematical Reasoning [6.318873143509028]
We introduce a novel algorithm called Step-level Value Preference Optimization (SVPO)
Our method achieves state-of-the-art performance on both in-domain and out-of-domain mathematical reasoning benchmarks.
arXiv Detail & Related papers (2024-06-16T09:06:17Z) - Robust Preference Optimization through Reward Model Distillation [68.65844394615702]
Language model (LM) post-training involves maximizing a reward function that is derived from preference annotations.
DPO is a popular offline alignment method that trains a policy directly on preference data without the need to train a reward model or apply reinforcement learning.
We analyze this phenomenon and propose distillation to get a better proxy for the true preference distribution over generation pairs.
arXiv Detail & Related papers (2024-05-29T17:39:48Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Soft Preference Optimization: Aligning Language Models to Expert Distributions [40.84391304598521]
SPO is a method for aligning generative models, such as Large Language Models (LLMs), with human preferences.
SPO integrates preference loss with a regularization term across the model's entire output distribution.
We showcase SPO's methodology, its theoretical foundation, and its comparative advantages in simplicity, computational efficiency, and alignment precision.
arXiv Detail & Related papers (2024-04-30T19:48:55Z) - Benchmarking PtO and PnO Methods in the Predictive Combinatorial Optimization Regime [59.27851754647913]
Predictive optimization is the precise modeling of many real-world applications, including energy cost-aware scheduling and budget allocation on advertising.
We develop a modular framework to benchmark 11 existing PtO/PnO methods on 8 problems, including a new industrial dataset for advertising.
Our study shows that PnO approaches are better than PtO on 7 out of 8 benchmarks, but there is no silver bullet found for the specific design choices of PnO.
arXiv Detail & Related papers (2023-11-13T13:19:34Z) - Statistical Rejection Sampling Improves Preference Optimization [42.57245965632205]
We introduce a novel approach to source preference data from the target optimal policy using rejection sampling.
We also propose a unified framework that enhances the loss functions used in both Sequence Likelihood (SLiC) and Direct Preference Optimization (DPO) from a preference modeling standpoint.
arXiv Detail & Related papers (2023-09-13T01:07:25Z) - Estimate-Then-Optimize versus Integrated-Estimation-Optimization versus
Sample Average Approximation: A Stochastic Dominance Perspective [15.832111591654293]
We show that a reverse behavior appears when the model class is well-specified and there is sufficient data.
We also demonstrate how standard sample average approximation (SAA) performs the worst when the model class is well-specified in terms of regret.
arXiv Detail & Related papers (2023-04-13T21:54:53Z) - Optimization of Annealed Importance Sampling Hyperparameters [77.34726150561087]
Annealed Importance Sampling (AIS) is a popular algorithm used to estimates the intractable marginal likelihood of deep generative models.
We present a parameteric AIS process with flexible intermediary distributions and optimize the bridging distributions to use fewer number of steps for sampling.
We assess the performance of our optimized AIS for marginal likelihood estimation of deep generative models and compare it to other estimators.
arXiv Detail & Related papers (2022-09-27T07:58:25Z) - Low-variance estimation in the Plackett-Luce model via quasi-Monte Carlo
sampling [58.14878401145309]
We develop a novel approach to producing more sample-efficient estimators of expectations in the PL model.
We illustrate our findings both theoretically and empirically using real-world recommendation data from Amazon Music and the Yahoo learning-to-rank challenge.
arXiv Detail & Related papers (2022-05-12T11:15:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.