Intelligently Weighting Multiple Reference Models for Direct Preference Optimization of LLMs
- URL: http://arxiv.org/abs/2512.10040v1
- Date: Wed, 10 Dec 2025 19:45:20 GMT
- Title: Intelligently Weighting Multiple Reference Models for Direct Preference Optimization of LLMs
- Authors: Skyler Wu, Aymen Echarghaoui,
- Abstract summary: Multiple-Reference Preference Optimization (MRPO) builds on Direct Preference Optimization (DPO)<n>Current methods for setting the reference weights are ad-hoc and statistically unsound, leading to unreliable performance.<n>We introduce four new weighting strategies: two offline methods that leverage held-out validation signal; one online method that uses a sliding-window estimator to reduce overfitting.<n>Experiments using Qwen2.5-0.5B as the policy model and seven reference models from the Llama, Mistral, Qwen, Yi, and Phi families (0.5B-14B each) show that all 4 of our strategies outperform the current
- Score: 2.0411082897313984
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fine-tuning is integral for aligning large language models (LLMs) with human preferences. Multiple-Reference Preference Optimization (MRPO) builds on Direct Preference Optimization (DPO) by fine-tuning LLMs on preference datasets while regularizing the policy towards a mixture of reference models to leverage their collective desirable properties. However, current methods for setting the reference weights are ad-hoc and statistically unsound, leading to unreliable performance. To address this, we introduce four new weighting strategies: two offline methods that leverage held-out validation signal; one online method that uses a sliding-window estimator to reduce overfitting; and an online method that treats reference weighting as a $K$-armed bandit via Thompson Sampling. Experiments using Qwen2.5-0.5B as the policy model and seven reference models from the Llama, Mistral, Qwen, Yi, and Phi families (0.5B-14B each) show that all 4 of our strategies outperform the current MRPO weighting methods on UltraFeedback and SafeRLHF in preference accuracy. More thought-provokingly, however, we find that single-reference DPO, using any of 6 out of 7 references, consistently outperforms all tested multiple-reference approaches -- calling into question the practical appeal of multiple-reference approaches.
Related papers
- InSPO: Unlocking Intrinsic Self-Reflection for LLM Preference Optimization [18.988527161000203]
We propose Intrinsic Self-reflective Preference Optimization (InSPO), deriving a globally optimal policy conditioning on both context and alternative responses.<n>InSPO serves as a plug-and-play enhancement without architectural changes or inference overhead.
arXiv Detail & Related papers (2025-12-29T00:59:23Z) - Bootstrapping LLMs via Preference-Based Policy Optimization [11.796630967998544]
bootstrapping large language models (LLMs) through preference-based policy optimization offers a promising direction for aligning model behavior with human preferences.<n>We propose a novel preference-based policy optimization framework that formulates the learning process as a min-max game between the main policy and a reward model.<n>Our approach consistently outperforms existing state-of-the-art preference optimization techniques.
arXiv Detail & Related papers (2025-11-17T01:41:14Z) - Reverse Preference Optimization for Complex Instruction Following [61.39734201711077]
We propose a simple yet effective method called Reverse Preference Optimization (RPO)<n>It mitigates noise in preference pairs by dynamically reversing the constraints within the instruction to ensure the chosen response is perfect.<n>RPO scales effectively across model sizes, with the 70B RPO model surpassing GPT-4o.
arXiv Detail & Related papers (2025-05-28T09:44:27Z) - Calibrated Multi-Preference Optimization for Aligning Diffusion Models [90.15024547673785]
Calibrated Preference Optimization (CaPO) is a novel method to align text-to-image (T2I) diffusion models.<n>CaPO incorporates the general preference from multiple reward models without human annotated data.<n> Experimental results show that CaPO consistently outperforms prior methods.
arXiv Detail & Related papers (2025-02-04T18:59:23Z) - Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.<n>We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.<n>We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z) - WPO: Enhancing RLHF with Weighted Preference Optimization [40.07940023654452]
Reinforcement learning from human feedback (RLHF) is a promising solution to align large language models (LLMs) more closely with human values.
Off-policy preference optimization often suffers from a distributional gap between the policy used for data collection and the target policy, leading to suboptimal optimization.
We propose a novel strategy to mitigate this problem by simulating on-policy learning with off-policy preference data.
arXiv Detail & Related papers (2024-06-17T17:59:13Z) - Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data.
Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z) - Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models.
The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models.
Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z) - Self-Play Preference Optimization for Language Model Alignment [75.83359213697854]
Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences.
We propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game.
Our approach, dubbed Self-Play Preference Optimization (SPPO), utilizes iterative policy updates to provably approximate the Nash equilibrium.
arXiv Detail & Related papers (2024-05-01T17:59:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.