Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback
- URL: http://arxiv.org/abs/2406.09279v2
- Date: Mon, 07 Oct 2024 21:24:59 GMT
- Title: Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback
- Authors: Hamish Ivison, Yizhong Wang, Jiacheng Liu, Zeqiu Wu, Valentina Pyatkin, Nathan Lambert, Noah A. Smith, Yejin Choi, Hannaneh Hajishirzi,
- Abstract summary: Learning from preference feedback has emerged as an essential step for improving the generation quality and performance of modern language models.
In this work, we identify four core aspects of preference-based learning: preference data, learning algorithm, reward model, and policy training prompts.
Our findings indicate that all aspects are important for performance, with better preference data leading to the largest improvements.
- Score: 110.16220825629749
- License:
- Abstract: Learning from preference feedback has emerged as an essential step for improving the generation quality and performance of modern language models (LMs). Despite its widespread use, the way preference-based learning is applied varies wildly, with differing data, learning algorithms, and evaluations used, making disentangling the impact of each aspect difficult. In this work, we identify four core aspects of preference-based learning: preference data, learning algorithm, reward model, and policy training prompts, systematically investigate the impact of these components on downstream model performance, and suggest a recipe for strong learning for preference feedback. Our findings indicate that all aspects are important for performance, with better preference data leading to the largest improvements, followed by the choice of learning algorithm, the use of improved reward models, and finally the use of additional unlabeled prompts for policy training. Notably, PPO outperforms DPO by up to 2.5% in math and 1.2% in general domains. High-quality preference data leads to improvements of up to 8% in instruction following and truthfulness. Despite significant gains of up to 5% in mathematical evaluation when scaling up reward models, we surprisingly observe marginal improvements in other categories. We publicly release the code used for training (https://github.com/hamishivi/EasyLM) and evaluating (https://github.com/allenai/open-instruct) our models, along with the models and datasets themselves (https://huggingface.co/collections/allenai/tulu-v25-suite-66676520fd578080e126f618).
Related papers
- Clear Preferences Leave Traces: Reference Model-Guided Sampling for Preference Learning [59.11519451499754]
Direct Preference Optimization (DPO) has emerged as a de-facto approach for aligning language models with human preferences.
Recent work has shown DPO's effectiveness relies on training data quality.
We discover that reference model probability space naturally detects high-quality training samples.
arXiv Detail & Related papers (2025-01-25T07:21:50Z) - Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Learn-Focus-Review (LFR) is a dynamic training approach that adapts to the model's learning progress.
LFR tracks the model's learning performance across data blocks (sequences of tokens) and prioritizes revisiting challenging regions of the dataset.
Compared to baseline models trained on the full datasets, LFR consistently achieved lower perplexity and higher accuracy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z) - CURE4Rec: A Benchmark for Recommendation Unlearning with Deeper Influence [55.21518669075263]
CURE4Rec is the first comprehensive benchmark for recommendation unlearning evaluation.
We consider the deeper influence of unlearning on recommendation fairness and robustness towards data with varying impact levels.
arXiv Detail & Related papers (2024-08-26T16:21:50Z) - Data-Centric Human Preference Optimization with Rationales [23.243583332894737]
Reinforcement learning from human feedback plays a crucial role in aligning language models towards human preferences.
This work shifts focus to improving preference learning through a data-centric approach.
We propose enriching existing preference datasets with machine-generated rationales that explain the reasons behind choices.
arXiv Detail & Related papers (2024-07-19T17:27:52Z) - Preference Learning Algorithms Do Not Learn Preference Rankings [62.335733662381884]
We study the conventional wisdom that preference learning trains models to assign higher likelihoods to more preferred outputs than less preferred outputs.
We find that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets.
arXiv Detail & Related papers (2024-05-29T21:29:44Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding [9.112203072394648]
Power-law scaling indicates that large-scale training with uniform sampling is prohibitively slow.
Active learning methods aim to increase data efficiency by prioritizing learning on the most relevant examples.
arXiv Detail & Related papers (2023-12-08T19:26:13Z) - Recommendation Unlearning via Influence Function [42.4931807753579]
We propose a new Influence Function-based Recommendation Unlearning (IFRU) framework, which efficiently updates the model without retraining.
IFRU achieves more than 250 times acceleration compared to retraining-based methods with recommendation performance comparable to full retraining.
arXiv Detail & Related papers (2023-07-05T09:42:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.