Related papers: Filtered Direct Preference Optimization

Filtered Direct Preference Optimization

URL: http://arxiv.org/abs/2404.13846v3
Date: Thu, 4 Jul 2024 07:40:53 GMT
Title: Filtered Direct Preference Optimization
Authors: Tetsuro Morimura, Mitsuki Sakamoto, Yuu Jinnai, Kenshi Abe, Kaito Ariu,
Abstract summary: Reinforcement learning from human feedback (RLHF) plays a crucial role in aligning language models with human preferences. This paper addresses the issue of text quality within the preference dataset by focusing on direct preference optimization (DPO) We propose an extension of DPO, termed filtered direct preference optimization (fDPO)
Score: 7.060398061192042
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning from human feedback (RLHF) plays a crucial role in aligning language models with human preferences. While the significance of dataset quality is generally recognized, explicit investigations into its impact within the RLHF framework, to our knowledge, have been limited. This paper addresses the issue of text quality within the preference dataset by focusing on direct preference optimization (DPO), an increasingly adopted reward-model-free RLHF method. We confirm that text quality significantly influences the performance of models optimized with DPO more than those optimized with reward-model-based RLHF. Building on this new insight, we propose an extension of DPO, termed filtered direct preference optimization (fDPO). fDPO uses a trained reward model to monitor the quality of texts within the preference dataset during DPO training. Samples of lower quality are discarded based on comparisons with texts generated by the model being optimized, resulting in a more accurate dataset. Experimental results demonstrate that fDPO enhances the final model performance. Our code is available at https://github.com/CyberAgentAILab/filtered-dpo.

Related papers

Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO [51.22869332661607]
We decompose the performance gap between reinforcement learning from human feedback and direct preference optimization under a representation gap.<n>We show that RLHF, DPO, or online DPO can outperform one another depending on the type of model mis-specifications.
arXiv Detail & Related papers (2025-05-26T09:54:02Z)
Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model [20.623037493149507]
We propose Pre-DPO, a simple yet effective DPO-based training paradigm that enhances preference optimization performance by leveraging a guiding reference model. Extensive experiments on AlpacaEval 2.0 and Arena-Hard v0.1 benchmarks demonstrate that Pre-DPO consistently improves the performance of both DPO and SimPO.
arXiv Detail & Related papers (2025-04-22T12:39:30Z)
Active Learning for Direct Preference Optimization [59.84525302418018]
Direct preference optimization (DPO) is a form of reinforcement learning from human feedback. We propose an active learning framework for DPO, which can be applied to collect human feedback online or to choose the most informative subset of already collected feedback offline.
arXiv Detail & Related papers (2025-03-03T00:36:31Z)
Less is More: Improving LLM Alignment via Preference Data Selection [46.9163802899686]
Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences.<n>We propose a novel margin-maximization principle for dataset curation in DPO training.<n>We show that our approach seamlessly extends to iterative DPO, yielding a roughly 3% improvement with 25% online data.
arXiv Detail & Related papers (2025-02-20T13:45:17Z)
Clear Preferences Leave Traces: Reference Model-Guided Sampling for Preference Learning [59.11519451499754]
Direct Preference Optimization (DPO) has emerged as a de-facto approach for aligning language models with human preferences. Recent work has shown DPO's effectiveness relies on training data quality. We discover that reference model probability space naturally detects high-quality training samples.
arXiv Detail & Related papers (2025-01-25T07:21:50Z)
CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation [20.959453238159863]
Large language models (LLMs) have shown great potential in natural language processing tasks, but their application to machine translation (MT) remains challenging. Direct Preference Optimization (DPO) has emerged as a simpler and more efficient alternative, but its performance depends heavily on the quality of preference data. We propose Confidence-Reward driven Preference Optimization (CRPO), a novel method that combines reward scores with model confidence to improve data selection for fine-tuning.
arXiv Detail & Related papers (2025-01-23T18:59:47Z)
MPPO: Multi Pair-wise Preference Optimization for LLMs with Arbitrary Negative Samples [22.521746860874305]
This study introduces the MPPO algorithm, which leverages the average likelihood of model responses to fit the reward function. Through a comparison of Point-wise, Pair-wise, and List-wise implementations, we found that the Pair-wise approach achieves the best performance. Experimental results demonstrate MPPO's outstanding performance across various benchmarks.
arXiv Detail & Related papers (2024-12-13T14:18:58Z)
Towards Improved Preference Optimization Pipeline: from Data Generation to Budget-Controlled Regularization [14.50339880957898]
We aim to improve the preference optimization pipeline by taking a closer look at preference data generation and training regularization techniques. For preference data generation, we propose an iterative pairwise ranking mechanism that derives preference ranking of completions using pairwise comparison signals. For training regularization, we observe that preference optimization tends to achieve better convergence when the LLM predicted likelihood of preferred samples gets slightly reduced.
arXiv Detail & Related papers (2024-11-07T23:03:11Z)
Optimizing LLMs with Direct Preferences: A Data Efficiency Perspective [4.548047308860141]
This study investigates the impact of different type of preference data on model performance. It aims to reduce their dependency on extensive amounts of preference data, which is expensive to collect.
arXiv Detail & Related papers (2024-10-22T00:11:41Z)
How to Evaluate Reward Models for RLHF [51.31240621943791]
We introduce a new benchmark for reward models that quantifies their ability to produce strong language models through RLHF (Reinforcement Learning from Human Feedback) We build a predictive model of downstream LLM performance by evaluating the reward model on proxy tasks. We launch an end-to-end RLHF experiment on a large-scale crowdsourced human preference platform to view real reward model downstream performance as ground truth.
arXiv Detail & Related papers (2024-10-18T21:38:21Z)
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [56.24431208419858]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset. We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset.
arXiv Detail & Related papers (2024-10-10T16:01:51Z)
ASFT: Aligned Supervised Fine-Tuning through Absolute Likelihood [14.512464277772194]
Aligned Supervised Fine-Tuning (ASFT) is an effective approach that better aligns Large Language Models with pair-wise datasets. ASFT mitigates the issue where the DPO loss function decreases the probability of generating human-dispreferred data. Extensive experiments demonstrate that ASFT is an effective alignment approach, consistently outperforming existing methods.
arXiv Detail & Related papers (2024-09-14T11:39:13Z)
Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC. We increase the consistency and informativeness of the pairwise preference signals through targeted modifications. We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z)
Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level [50.897438358317686]
We show that iLR-DPO can enhance a 7B model to perform on par with GPT-4 without increasing verbosity. Specifically, our 7B model achieves a $50.5%$ length-controlled win rate against $texttGPT-4 Preview$ on AlpacaEval 2.0.
arXiv Detail & Related papers (2024-06-17T17:55:38Z)
Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF [80.32171988565999]
We introduce a unified approach to online and offline RLHF -- value-incentivized preference optimization (VPO) VPO regularizes the maximum-likelihood estimate of the reward function with the corresponding value function. Experiments on text summarization and dialog verify the practicality and effectiveness of VPO.
arXiv Detail & Related papers (2024-05-29T17:51:42Z)
Policy Optimization in RLHF: The Impact of Out-of-preference Data [17.126977660436225]
This paper examines two popular alignment methods: Direct Preference Optimization (DPO) and Reward-Model-Based Policy Optimization (RMB-PO) A variant of RMB-PO, referred to as RMB-PO+ is also considered. In particular, compared with DPO, RMB-PO additionally uses policy-generated data, and RMB-PO+ further leverages new, preference-free data.
arXiv Detail & Related papers (2023-12-17T02:14:15Z)
Diffusion Model Alignment Using Direct Preference Optimization [103.2238655827797]
Diffusion-DPO is a method to align diffusion models to human preferences by directly optimizing on human comparison data. We fine-tune the base model of the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model with Diffusion-DPO. We also develop a variant that uses AI feedback and has comparable performance to training on human preferences.
arXiv Detail & Related papers (2023-11-21T15:24:05Z)
Statistical Rejection Sampling Improves Preference Optimization [42.57245965632205]
We introduce a novel approach to source preference data from the target optimal policy using rejection sampling. We also propose a unified framework that enhances the loss functions used in both Sequence Likelihood (SLiC) and Direct Preference Optimization (DPO) from a preference modeling standpoint.
arXiv Detail & Related papers (2023-09-13T01:07:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.