Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive
- URL: http://arxiv.org/abs/2402.13228v2
- Date: Wed, 3 Jul 2024 13:46:33 GMT
- Title: Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive
- Authors: Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, Colin White,
- Abstract summary: We show theoretically that the standard DPO loss can lead to a reduction of the model's likelihood of the preferred examples.
We design DPO-Positive (DPOP), a new loss function and training procedure which avoids this failure mode.
Surprisingly, we find that DPOP outperforms DPO and other fine-tuning procedures across a wide variety of datasets and downstream tasks.
- Score: 15.066029556877721
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Direct Preference Optimisation (DPO) is effective at significantly improving the performance of large language models (LLMs) on downstream tasks such as reasoning, summarisation, and alignment. Using pairs of preferred and dispreferred data, DPO models the relative probability of picking one response over another. In this work, first we show theoretically that the standard DPO loss can lead to a reduction of the model's likelihood of the preferred examples, as long as the relative probability between the preferred and dispreferred classes increases. We then show empirically that this phenomenon occurs when fine-tuning LLMs on common datasets, especially datasets in which the edit distance between pairs of completions is low. Using these insights, we design DPO-Positive (DPOP), a new loss function and training procedure which avoids this failure mode. Surprisingly, we find that DPOP outperforms DPO and other fine-tuning procedures across a wide variety of datasets and downstream tasks, including datasets with high edit distances between completions. Furthermore, we find that the DPOP-tuned model outperforms the DPO-tuned model (all else equal) on benchmarks independent of the fine-tuning data, such as MT-Bench. Finally, using DPOP, we create and open-source Smaug-34B and Smaug-72B, with the latter becoming the first open-source LLM to surpass an average accuracy of 80% on the HuggingFace Open LLM Leaderboard.
Related papers
- Less is More: Improving LLM Alignment via Preference Data Selection [46.9163802899686]
Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences.
We propose a novel margin-maximization principle for dataset curation in DPO training.
By using just 10% of the Ultrafeedback dataset, our approach achieves 3% to 8% improvements across various Llama and Mistral series models.
arXiv Detail & Related papers (2025-02-20T13:45:17Z) - SPRec: Self-Play to Debias LLM-based Recommendation [23.875509546540904]
Large language models (LLMs) have attracted significant attention in recommendation systems.
We propose SPRec, a novel self-play framework designed to mitigate over-recommendation and improve fairness without requiring additional data or manual intervention.
arXiv Detail & Related papers (2024-12-12T12:53:30Z) - SoPo: Text-to-Motion Generation Using Semi-Online Preference Optimization [82.83603957387442]
We focus on fine-tuning text-to-motion models to consistently favor high-quality, human-preferred motions.
In this work, we theoretically investigate the DPO under both online and offline settings.
We introduce Semi-online Preference Optimization (SoPo), a DPO-based method for training text-to-motion models.
arXiv Detail & Related papers (2024-12-06T14:50:38Z) - Modulated Intervention Preference Optimization (MIPO): Keep the Easy, Refine the Difficult [0.48951183832371004]
We propose textbfModulated Intervention Preference Optimization (MIPO) to address this issue.
MIPO modulates the degree of intervention from the reference model based on how well the given data is aligned with it.
We compare the performance of MIPO and DPO using Mistral-7B and Llama3-8B in Alpaca Eval 2.0 and MT-Bench.
arXiv Detail & Related papers (2024-09-26T05:24:14Z) - ASFT: Aligned Supervised Fine-Tuning through Absolute Likelihood [14.512464277772194]
Aligned Supervised Fine-Tuning (ASFT) is an effective approach that better aligns Large Language Models with pair-wise datasets.
ASFT mitigates the issue where the DPO loss function decreases the probability of generating human-dispreferred data.
Extensive experiments demonstrate that ASFT is an effective alignment approach, consistently outperforming existing methods.
arXiv Detail & Related papers (2024-09-14T11:39:13Z) - Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.
We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.
We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z) - Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [54.05511925104712]
We propose a simple, effective, and data-efficient method called Step-DPO.
Step-DPO treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically.
Our findings demonstrate that as few as 10K preference data pairs and fewer than 500 Step-DPO training steps can yield a nearly 3% gain in accuracy on MATH for models with over 70B parameters.
arXiv Detail & Related papers (2024-06-26T17:43:06Z) - Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data.
Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z) - Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models.
The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models.
Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z) - DavIR: Data Selection via Implicit Reward for Large Language Models [62.59514469369608]
DavIR is a model-based data selection method for post-training Large Language Models.
We show that 6% of Alpaca dataset selected with DavIR can steer both the LLaMA and Gemma model family to produce superior performance compared to the same models trained on the full 52K dataset.
arXiv Detail & Related papers (2023-10-16T07:26:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.