Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback
- URL: http://arxiv.org/abs/2406.09279v2
- Date: Mon, 07 Oct 2024 21:24:59 GMT
- Title: Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback
- Authors: Hamish Ivison, Yizhong Wang, Jiacheng Liu, Zeqiu Wu, Valentina Pyatkin, Nathan Lambert, Noah A. Smith, Yejin Choi, Hannaneh Hajishirzi,
- Abstract summary: Learning from preference feedback has emerged as an essential step for improving the generation quality and performance of modern language models.
In this work, we identify four core aspects of preference-based learning: preference data, learning algorithm, reward model, and policy training prompts.
Our findings indicate that all aspects are important for performance, with better preference data leading to the largest improvements.
- Score: 110.16220825629749
- License:
- Abstract: Learning from preference feedback has emerged as an essential step for improving the generation quality and performance of modern language models (LMs). Despite its widespread use, the way preference-based learning is applied varies wildly, with differing data, learning algorithms, and evaluations used, making disentangling the impact of each aspect difficult. In this work, we identify four core aspects of preference-based learning: preference data, learning algorithm, reward model, and policy training prompts, systematically investigate the impact of these components on downstream model performance, and suggest a recipe for strong learning for preference feedback. Our findings indicate that all aspects are important for performance, with better preference data leading to the largest improvements, followed by the choice of learning algorithm, the use of improved reward models, and finally the use of additional unlabeled prompts for policy training. Notably, PPO outperforms DPO by up to 2.5% in math and 1.2% in general domains. High-quality preference data leads to improvements of up to 8% in instruction following and truthfulness. Despite significant gains of up to 5% in mathematical evaluation when scaling up reward models, we surprisingly observe marginal improvements in other categories. We publicly release the code used for training (https://github.com/hamishivi/EasyLM) and evaluating (https://github.com/allenai/open-instruct) our models, along with the models and datasets themselves (https://huggingface.co/collections/allenai/tulu-v25-suite-66676520fd578080e126f618).
Related papers
- CURE4Rec: A Benchmark for Recommendation Unlearning with Deeper Influence [55.21518669075263]
CURE4Rec is the first comprehensive benchmark for recommendation unlearning evaluation.
We consider the deeper influence of unlearning on recommendation fairness and robustness towards data with varying impact levels.
arXiv Detail & Related papers (2024-08-26T16:21:50Z) - Data-Centric Human Preference Optimization with Rationales [23.243583332894737]
Reinforcement learning from human feedback plays a crucial role in aligning language models towards human preferences.
This work shifts focus to improving preference learning through a data-centric approach.
We propose enriching existing preference datasets with machine-generated rationales that explain the reasons behind choices.
arXiv Detail & Related papers (2024-07-19T17:27:52Z) - Preference Learning Algorithms Do Not Learn Preference Rankings [62.335733662381884]
We study the conventional wisdom that preference learning trains models to assign higher likelihoods to more preferred outputs than less preferred outputs.
We find that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets.
arXiv Detail & Related papers (2024-05-29T21:29:44Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z) - Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding [9.112203072394648]
Power-law scaling indicates that large-scale training with uniform sampling is prohibitively slow.
Active learning methods aim to increase data efficiency by prioritizing learning on the most relevant examples.
arXiv Detail & Related papers (2023-12-08T19:26:13Z) - Recommendation Unlearning via Influence Function [42.4931807753579]
We propose a new Influence Function-based Recommendation Unlearning (IFRU) framework, which efficiently updates the model without retraining.
IFRU achieves more than 250 times acceleration compared to retraining-based methods with recommendation performance comparable to full retraining.
arXiv Detail & Related papers (2023-07-05T09:42:51Z) - Robust Learning with Progressive Data Expansion Against Spurious
Correlation [65.83104529677234]
We study the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features.
Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process.
We propose a new training algorithm called PDE that efficiently enhances the model's robustness for a better worst-group performance.
arXiv Detail & Related papers (2023-06-08T05:44:06Z) - Generalized Reinforcement Meta Learning for Few-Shot Optimization [3.7675996866306845]
We present a generic and flexible Reinforcement Learning (RL) based meta-learning framework for the problem of few-shot learning.
Our framework could be easily extended to do network architecture search.
arXiv Detail & Related papers (2020-05-04T03:21:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.