Triple Preference Optimization: Achieving Better Alignment with Less Data in a Single Step Optimization
- URL: http://arxiv.org/abs/2405.16681v1
- Date: Sun, 26 May 2024 20:18:11 GMT
- Title: Triple Preference Optimization: Achieving Better Alignment with Less Data in a Single Step Optimization
- Authors: Amir Saeidi, Shivanshu Verma, Aswin RRV, Chitta Baral,
- Abstract summary: Triple Preference Optimization (TPO) is designed to align large language models with three preferences without requiring a separate Supervised Fine-Tuned (SFT) model.
We show that TPO achieves superior results compared to models aligned through other methods such as SFT, DPO, KTO, IPO, CPO, and ORPO.
- Score: 35.36615140853107
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) perform well across diverse tasks, but aligning them with human demonstrations is challenging. Recently, Reinforcement Learning (RL)-free methods like Direct Preference Optimization (DPO) have emerged, offering improved stability and scalability while retaining competitive performance relative to RL-based methods. However, while RL-free methods deliver satisfactory performance, they require significant data to develop a robust Supervised Fine-Tuned (SFT) model and an additional step to fine-tune this model on a preference dataset, which constrains their utility and scalability. In this paper, we introduce Triple Preference Optimization (TPO), a new preference learning method designed to align an LLM with three preferences without requiring a separate SFT step and using considerably less data. Through a combination of practical experiments and theoretical analysis, we show the efficacy of TPO as a single-step alignment strategy. Specifically, we fine-tuned the Phi-2 (2.7B) and Mistral (7B) models using TPO directly on the UltraFeedback dataset, achieving superior results compared to models aligned through other methods such as SFT, DPO, KTO, IPO, CPO, and ORPO. Moreover, the performance of TPO without the SFT component led to notable improvements in the MT-Bench score, with increases of +1.27 and +0.63 over SFT and DPO, respectively. Additionally, TPO showed higher average accuracy, surpassing DPO and SFT by 4.2% and 4.97% on the Open LLM Leaderboard benchmarks. Our code is publicly available at https://github.com/sahsaeedi/triple-preference-optimization .
Related papers
- Less is More: Improving LLM Alignment via Preference Data Selection [46.9163802899686]
Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences.
We propose a novel margin-maximization principle for dataset curation in DPO training.
By using just 10% of the Ultrafeedback dataset, our approach achieves 3% to 8% improvements across various Llama and Mistral series models.
arXiv Detail & Related papers (2025-02-20T13:45:17Z) - Dynamic Noise Preference Optimization for LLM Self-Improvement via Synthetic Data [51.62162460809116]
We introduce Dynamic Noise Preference Optimization (DNPO) to ensure consistent improvements across iterations.
In experiments with Zephyr-7B, DNPO consistently outperforms existing methods, showing an average performance boost of 2.6%.
DNPO shows a significant improvement in model-generated data quality, with a 29.4% win-loss rate gap compared to the baseline in GPT-4 evaluations.
arXiv Detail & Related papers (2025-02-08T01:20:09Z) - FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings [40.605411087380226]
We introduce FocalPO, a DPO variant that prioritizes enhancing the model's understanding of pairs that it can already rank correctly.
Inspired by Focal Loss used in vision tasks, FocalPO achieves this by adding a modulating factor to dynamically scale DPO loss.
arXiv Detail & Related papers (2025-01-11T21:41:27Z) - Uncertainty-Penalized Direct Preference Optimization [52.387088396044206]
We develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes.
The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples.
We show improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.
arXiv Detail & Related papers (2024-10-26T14:24:37Z) - Accelerated Preference Optimization for Large Language Model Alignment [60.22606527763201]
Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences.
Direct Preference Optimization (DPO) formulates RLHF as a policy optimization problem without explicitly estimating the reward function.
We propose a general Accelerated Preference Optimization (APO) framework, which unifies many existing preference optimization algorithms.
arXiv Detail & Related papers (2024-10-08T18:51:01Z) - TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization With Estimated Weights [73.9088920210495]
We propose a token-level importance sampling DPO objective named TIS-DPO that assigns importance weights to each token based on its reward.
TIS-DPO significantly outperforms various baseline methods on harmlessness and helpfulness alignment and summarization tasks.
arXiv Detail & Related papers (2024-10-06T04:03:00Z) - Length Desensitization in Direct Preference Optimization [26.664176443756773]
It has been observed that DPO tends to over-optimize for verbosity, which can detrimentally affect both performance and user experience.
We propose a length-desensitization improvement method for DPO, termed LD-DPO.
The proposed method aims to desensitize DPO to data length by decoupling explicit length preference, which is relatively insignificant, from the other implicit preferences.
arXiv Detail & Related papers (2024-09-10T10:49:38Z) - Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [54.05511925104712]
We propose a simple, effective, and data-efficient method called Step-DPO.
Step-DPO treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically.
Our findings demonstrate that as few as 10K preference data pairs and fewer than 500 Step-DPO training steps can yield a nearly 3% gain in accuracy on MATH for models with over 70B parameters.
arXiv Detail & Related papers (2024-06-26T17:43:06Z) - 3D-Properties: Identifying Challenges in DPO and Charting a Path Forward [17.27880657597116]
We revisit DPO, analyzing its theoretical foundations and empirical performance.
We identify three key properties, termed 3D properties, that emerge from DPO's learning process.
We propose simple regularization techniques that improve training stability and performance.
arXiv Detail & Related papers (2024-06-11T14:59:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.