DPO Meets PPO: Reinforced Token Optimization for RLHF
- URL: http://arxiv.org/abs/2404.18922v3
- Date: Tue, 11 Feb 2025 17:23:13 GMT
- Title: DPO Meets PPO: Reinforced Token Optimization for RLHF
- Authors: Han Zhong, Zikang Shan, Guhao Feng, Wei Xiong, Xinle Cheng, Li Zhao, Di He, Jiang Bian, Liwei Wang,
- Abstract summary: We introduce an algorithm that learns the token-wise reward function from preference data and performs policy optimization based on this learned token-wise reward signal.
Experiments demonstrate that textttRTO performs better than PPO and other direct preference learning algorithms.
- Score: 35.638723885233475
- License:
- Abstract: In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards -- a challenging scenario in traditional deep reinforcement learning. Despite the great successes of PPO in the alignment of large language models, its open-source implementation is still largely sub-optimal. To address these issues, we introduce a framework that models RLHF problems as a Markov decision process (MDP), enabling the capture of fine-grained token-wise information. Under this framework, we introduce an algorithm Reinforced Token Optimization (\texttt{RTO}), which learns the token-wise reward function from preference data and performs policy optimization based on this learned token-wise reward signal. Theoretically, \texttt{RTO} is proven to have the capability of finding the near-optimal policy sample-efficiently. For its practical implementation, \texttt{RTO} innovatively integrates Direct Preference Optimization (DPO) and PPO. DPO, originally derived from sparse sentence rewards, surprisingly provides us with a token-wise characterization of response quality, which is seamlessly incorporated into our subsequent PPO training stage. Extensive experiments demonstrate that \texttt{RTO} performs better than PPO and other direct preference learning algorithms. In particular, RTO outperforms PPO by 7.5 points on the AlpacaEval 2 benchmark and by 4.1 points on Arena-Hard. Our code and models are available at \href{https://github.com/zkshan2002/RTO}{https://github.com/zkshan2002/RTO}.
Related papers
- Federated Fine-Tuning of Large Language Models: Kahneman-Tversky vs. Direct Preference Optimization [49.88778604259453]
We evaluate Kahneman-Tversky Optimization (KTO) as a fine-tuning method for large language models (LLMs) in federated learning (FL) settings.
In both its original (KTOO) and redistributed (KTOR) configurations, KTO consistently outperforms DPO across all benchmarks.
These findings establish KTO as a robust and scalable fine-tuning method for FL, motivating its adoption for privacy-preserving, decentralized, and heterogeneous environments.
arXiv Detail & Related papers (2025-02-20T01:44:21Z) - MPPO: Multi Pair-wise Preference Optimization for LLMs with Arbitrary Negative Samples [22.521746860874305]
This study introduces the MPPO algorithm, which leverages the average likelihood of model responses to fit the reward function.
Through a comparison of Point-wise, Pair-wise, and List-wise implementations, we found that the Pair-wise approach achieves the best performance.
Experimental results demonstrate MPPO's outstanding performance across various benchmarks.
arXiv Detail & Related papers (2024-12-13T14:18:58Z) - Accelerated Preference Optimization for Large Language Model Alignment [60.22606527763201]
Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences.
Direct Preference Optimization (DPO) formulates RLHF as a policy optimization problem without explicitly estimating the reward function.
We propose a general Accelerated Preference Optimization (APO) framework, which unifies many existing preference optimization algorithms.
arXiv Detail & Related papers (2024-10-08T18:51:01Z) - REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.
In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.
We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z) - From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function [50.812404038684505]
We show that we can derive DPO in the token-level MDP as a general inverse Q-learning algorithm, which satisfies the Bellman equation.
We discuss applications of our work, including information elicitation in multi-turn dialogue, reasoning, agentic applications and end-to-end training of multi-model systems.
arXiv Detail & Related papers (2024-04-18T17:37:02Z) - Token-level Direct Preference Optimization [8.249403373337024]
Fine-tuning pre-trained Large Language Models is essential to align them with human values and intentions.
We introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing policy at the token level.
arXiv Detail & Related papers (2024-04-18T08:49:38Z) - Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study [16.99550556866219]
Reinforcement Learning from Human Feedback (RLHF) is currently the most widely used method to align large language models (LLMs) with human preferences.
In academic benchmarks, state-of-the-art results are often achieved via reward-free methods, such as Direct Preference Optimization (DPO)
We show that PPO is able to surpass other alignment methods in all cases and achieve state-of-the-art results in challenging code competitions.
arXiv Detail & Related papers (2024-04-16T16:51:53Z) - Preference as Reward, Maximum Preference Optimization with Importance Sampling [3.7040071165219595]
We propose a simple and intuitive off-policy preference optimization algorithm from an importance sampling view, which we call Maximum Preference Optimization (MPO)
MPO achieves the best of both worlds by combining the objectives of RLHF and IPO while being an off-policy algorithm.
arXiv Detail & Related papers (2023-12-27T06:34:54Z) - Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for
LLM Alignment [37.52249093928251]
This paper proposes a new framework, reinforcement learning with relative feedback, and a novel trajectory-wise policy gradient algorithm.
We show theoretically that P3O is invariant to equivalent rewards and avoids the complexity of PPO.
Empirical evaluations demonstrate that P3O outperforms PPO in the KL-Reward trade-off and can align with human preferences as well as or better than prior methods.
arXiv Detail & Related papers (2023-09-30T01:23:22Z) - Secrets of RLHF in Large Language Models Part I: PPO [81.01936993929127]
Large language models (LLMs) have formulated a blueprint for the advancement of artificial general intelligence.
reinforcement learning with human feedback (RLHF) emerges as the pivotal technological paradigm underpinning this pursuit.
In this report, we dissect the framework of RLHF, re-evaluate the inner workings of PPO, and explore how the parts comprising PPO algorithms impact policy agent training.
arXiv Detail & Related papers (2023-07-11T01:55:24Z) - Provably Efficient Exploration in Policy Optimization [117.09887790160406]
This paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO)
OPPO achieves $tildeO(sqrtd2 H3 T )$ regret.
To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.
arXiv Detail & Related papers (2019-12-12T08:40:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.