Related papers: On Extending Direct Preference Optimization to Accommodate Ties

On Extending Direct Preference Optimization to Accommodate Ties

URL: http://arxiv.org/abs/2409.17431v2
Date: Tue, 04 Nov 2025 15:41:06 GMT
Title: On Extending Direct Preference Optimization to Accommodate Ties
Authors: Jinghong Chen, Guangyu Yang, Weizhe Lin, Jingbiao Mei, Bill Byrne,
Abstract summary: Two DPO variants explicitly model the possibility of declaring a tie in pair-wise comparisons.<n>We replace the Bradley-Terry model in DPO with two well-known modeling extensions.<n>Experiments in neural machine translation and summarization show that explicitly labeled ties can be added to datasets.
Score: 27.23138831535272
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We derive and investigate two DPO variants that explicitly model the possibility of declaring a tie in pair-wise comparisons. We replace the Bradley-Terry model in DPO with two well-known modeling extensions, by Rao and Kupper and by Davidson, that assign probability to ties as alternatives to clear preferences. Our experiments in neural machine translation and summarization show that explicitly labeled ties can be added to the datasets for these DPO variants without the degradation in task performance that is observed when the same tied pairs are presented to DPO. We find empirically that the inclusion of ties leads to stronger regularization with respect to the reference policy as measured by KL divergence, and we see this even for DPO in its original form. We provide a theoretical explanation for this regularization effect using ideal DPO policy theory. We further show performance improvements over DPO in translation and mathematical reasoning using our DPO variants. We find it can be beneficial to include ties in preference optimization rather than simply discard them, as is done in common practice.

Related papers

Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO [51.22869332661607]
We decompose the performance gap between reinforcement learning from human feedback and direct preference optimization under a representation gap.<n>We show that RLHF, DPO, or online DPO can outperform one another depending on the type of model mis-specifications.
arXiv Detail & Related papers (2025-05-26T09:54:02Z)
Towards Self-Improvement of Diffusion Models via Group Preference Optimization [10.6096255671291]
Group Preference Optimization (GPO) is an effective self-improvement method that enhances performance without requiring external data.<n>GPO improves the accurate counting and text rendering capabilities of the Stable Diffusion 3.5 Medium by 20 percentage points.<n>As a plug-and-play method, no extra overhead is introduced during inference.
arXiv Detail & Related papers (2025-05-16T10:04:57Z)
Inducing Robustness in a 2 Dimensional Direct Preference Optimization Paradigm [16.66633426354087]
Direct Preference optimisation (DPO) has emerged as a powerful method for aligning Large Language Models with human preferences.<n>We investigate the performance of DPO using open-source preference datasets.<n>We propose an approach of incorporating segment-level score noise robustness to the 2D-DPO algorithm.
arXiv Detail & Related papers (2025-05-03T05:59:13Z)
A Survey of Direct Preference Optimization [103.59317151002693]
Large Language Models (LLMs) have demonstrated unprecedented generative capabilities.<n>Their alignment with human values remains critical for ensuring helpful and harmless deployments.<n>Direct Preference Optimization (DPO) has recently gained prominence as a streamlined alternative.
arXiv Detail & Related papers (2025-03-12T08:45:15Z)
Preference-Based Alignment of Discrete Diffusion Models [14.874943508610857]
We introduce Discrete Diffusion DPO (D2-DPO), the first adaptation of Direct Preference Optimization (DPO) to discrete diffusion models formulated as continuous-time Markov chains. Our approach derives a novel loss function that directly fine-tunes the generative process using preference data while preserving fidelity to a reference distribution. Our results highlight that D2-DPO enables controlled fine-tuning without requiring explicit reward models, making it a practical alternative to reinforcement learning-based approaches.
arXiv Detail & Related papers (2025-03-11T11:07:35Z)
C2-DPO: Constrained Controlled Direct Preference Optimization [22.730518243326394]
Direct preference optimization (textttDPO) has emerged as a promising approach for solving the alignment problem in AI.<n>We show that textttDPO loss could be derived by starting from an alternative optimization problem that only defines the KL guardrail on in-sample responses.
arXiv Detail & Related papers (2025-02-22T00:38:44Z)
Federated Fine-Tuning of Large Language Models: Kahneman-Tversky vs. Direct Preference Optimization [49.88778604259453]
We evaluate Kahneman-Tversky Optimization (KTO) as a fine-tuning method for large language models (LLMs) in federated learning (FL) settings. In both its original (KTOO) and redistributed (KTOR) configurations, KTO consistently outperforms DPO across all benchmarks. These findings establish KTO as a robust and scalable fine-tuning method for FL, motivating its adoption for privacy-preserving, decentralized, and heterogeneous environments.
arXiv Detail & Related papers (2025-02-20T01:44:21Z)
Calibrated Multi-Preference Optimization for Aligning Diffusion Models [92.90660301195396]
Calibrated Preference Optimization (CaPO) is a novel method to align text-to-image (T2I) diffusion models. CaPO incorporates the general preference from multiple reward models without human annotated data. Experimental results show that CaPO consistently outperforms prior methods.
arXiv Detail & Related papers (2025-02-04T18:59:23Z)
Uncertainty-Penalized Direct Preference Optimization [52.387088396044206]
We develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes. The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples. We show improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.
arXiv Detail & Related papers (2024-10-26T14:24:37Z)
Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both [6.102274021710727]
This paper introduces DRDO (Direct Reward Distillation and policy-Optimization), which simultaneously models rewards and preferences to avoid degeneracy. Results on the Ultrafeedback and TL;DR datasets demonstrate that DRDO-trained policies surpass methods such as DPO and e-DPO in terms of expected rewards.
arXiv Detail & Related papers (2024-10-11T02:19:11Z)
Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC. We increase the consistency and informativeness of the pairwise preference signals through targeted modifications. We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z)
Understanding Reference Policies in Direct Preference Optimization [50.67309013764383]
Direct Preference Optimization (DPO) has become a widely used training method for the instruction fine-tuning of large language models (LLMs) This work explores an under-investigated aspect of DPO - its dependency on the reference model or policy.
arXiv Detail & Related papers (2024-07-18T17:08:10Z)
Eliminating Biased Length Reliance of Direct Preference Optimization via Down-Sampled KL Divergence [31.03305638930844]
Direct Preference Optimization (DPO) has emerged as a prominent algorithm for the direct and robust alignment of Large Language Models with human preferences. Despite its promising efficacy, DPO faces a notable drawback: "verbosity" We propose that the issue also stems from an inherent algorithmic length reliance in DPO.
arXiv Detail & Related papers (2024-06-16T14:24:30Z)
Robust Preference Optimization through Reward Model Distillation [68.65844394615702]
Language model (LM) post-training involves maximizing a reward function that is derived from preference annotations. DPO is a popular offline alignment method that trains a policy directly on preference data without the need to train a reward model or apply reinforcement learning. We analyze this phenomenon and propose distillation to get a better proxy for the true preference distribution over generation pairs.
arXiv Detail & Related papers (2024-05-29T17:39:48Z)
D2PO: Discriminator-Guided DPO with Response Evaluation Models [63.71853401569461]
We propose D2PO, discriminator-guided DPO, for the online setting where preferences are being collected throughout learning. As we collect gold preferences, we use these not only to train our policy, but to train a discriminative response evaluation model to silver-label even more synthetic data for policy training. We show conditions under which silver labeling is most helpful: it is most effective when training the policy with DPO, outperforming traditional PPO, and benefits from maintaining a separate discriminator from the policy model.
arXiv Detail & Related papers (2024-05-02T17:44:41Z)
Human Alignment of Large Language Models through Online Preference Optimisation [50.52545798589968]
We show the equivalence between two recent alignment methods, namely Identity Policy optimisation (IPO) and Nash Mirror Descent (Nash-MD) This equivalence can be proven when we consider the online version of IPO, that is when both generations are sampled by the online policy and annotated by a trained preference model. We introduce the IPO-MD algorithm that generates data with a mixture policy (between the online and reference policy) similarly as the general Nash-MD algorithm.
arXiv Detail & Related papers (2024-03-13T15:47:26Z)
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive [15.066029556877721]
We show theoretically that the standard DPO loss can lead to a reduction of the model's likelihood of the preferred examples. We design DPO-Positive (DPOP), a new loss function and training procedure which avoids this failure mode. Surprisingly, we find that DPOP outperforms DPO and other fine-tuning procedures across a wide variety of datasets and downstream tasks.
arXiv Detail & Related papers (2024-02-20T18:42:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.