COPR: Continual Human Preference Learning via Optimal Policy
Regularization
- URL: http://arxiv.org/abs/2402.14228v2
- Date: Tue, 27 Feb 2024 08:47:37 GMT
- Title: COPR: Continual Human Preference Learning via Optimal Policy
Regularization
- Authors: Han Zhang, Lin Gui, Yu Lei, Yuanzhao Zhai, Yehong Zhang, Yulan He, Hui
Wang, Yue Yu, Kam-Fai Wong, Bin Liang, Ruifeng Xu
- Abstract summary: Reinforcement Learning from Human Feedback (RLHF) is commonly utilized to improve the alignment of Large Language Models (LLMs) with human preferences.
We propose the Continual Optimal Policy Regularization (COPR) method, which draws inspiration from the optimal policy theory.
- Score: 56.1193256819677
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement Learning from Human Feedback (RLHF) is commonly utilized to
improve the alignment of Large Language Models (LLMs) with human preferences.
Given the evolving nature of human preferences, continual alignment becomes
more crucial and practical in comparison to traditional static alignment.
Nevertheless, making RLHF compatible with Continual Learning (CL) is
challenging due to its complex process. Meanwhile, directly learning new human
preferences may lead to Catastrophic Forgetting (CF) of historical preferences,
resulting in helpless or harmful outputs. To overcome these challenges, we
propose the Continual Optimal Policy Regularization (COPR) method, which draws
inspiration from the optimal policy theory. COPR utilizes a sampling
distribution as a demonstration and regularization constraints for CL. It
adopts the Lagrangian Duality (LD) method to dynamically regularize the current
policy based on the historically optimal policy, which prevents CF and avoids
over-emphasizing unbalanced objectives. We also provide formal proof for the
learnability of COPR. The experimental results show that COPR outperforms
strong CL baselines on our proposed benchmark, in terms of reward-based, GPT-4
evaluations and human assessment. Furthermore, we validate the robustness of
COPR under various CL settings, including different backbones, replay memory
sizes, and learning orders.
Related papers
- Joint Demonstration and Preference Learning Improves Policy Alignment with Human Feedback [58.049113055986375]
We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF) to train reward models and the policy.
The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms.
We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo.
arXiv Detail & Related papers (2024-06-11T01:20:53Z) - Prior Constraints-based Reward Model Training for Aligning Large Language Models [58.33118716810208]
This paper proposes a Prior Constraints-based Reward Model (namely PCRM) training method to mitigate this problem.
PCRM incorporates prior constraints, specifically, length ratio and cosine similarity between outputs of each comparison pair, during reward model training to regulate optimization magnitude and control score margins.
Experimental results demonstrate that PCRM significantly improves alignment performance by effectively constraining reward score scaling.
arXiv Detail & Related papers (2024-04-01T07:49:11Z) - CLHA: A Simple yet Effective Contrastive Learning Framework for Human Alignment [42.71324708567498]
Reinforcement learning from human feedback (RLHF) is a crucial technique in aligning large language models (LLMs) with human preferences.
We present a simple yet effective Contrastive Learning Framework for Human Alignment (CLHA) to align LLMs with human preferences directly.
arXiv Detail & Related papers (2024-03-25T11:37:15Z) - Uncertainty-Penalized Reinforcement Learning from Human Feedback with
Diverse Reward LoRA Ensembles [26.955375398765085]
Reinforcement learning from human feedback (RLHF) emerges as a promising paradigm for aligning large language models (LLMs)
In this paper, we observe the weakness of KL regularization which is commonly employed in existing RLHF methods to address overoptimization.
We propose uncertainty-penalized RLHF (UP-RLHF), which incorporates uncertainty regularization during RL-finetuning.
arXiv Detail & Related papers (2023-12-30T14:14:14Z) - Nash Learning from Human Feedback [86.09617990412941]
We introduce an alternative pipeline for the fine-tuning of large language models using pairwise human feedback.
We term this approach Nash learning from human feedback (NLHF)
We present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent.
arXiv Detail & Related papers (2023-12-01T19:26:23Z) - COPR: Continual Learning Human Preference through Optimal Policy Regularization [32.54658750353585]
We propose a new method called Continual Optimal Policy Regularization (COPR)
COPR involves a single learning phase and doesn't necessitate complex reinforcement learning.
Our experimental results show that COPR outperforms strong Continuous Learning (CL) baselines.
arXiv Detail & Related papers (2023-10-24T10:05:32Z) - Contrastive Preference Learning: Learning from Human Feedback without RL [71.77024922527642]
We introduce Contrastive Preference Learning (CPL), an algorithm for learning optimal policies from preferences without learning reward functions.
CPL is fully off-policy, uses only a simple contrastive objective, and can be applied to arbitrary MDPs.
arXiv Detail & Related papers (2023-10-20T16:37:56Z) - Statistical Rejection Sampling Improves Preference Optimization [42.57245965632205]
We introduce a novel approach to source preference data from the target optimal policy using rejection sampling.
We also propose a unified framework that enhances the loss functions used in both Sequence Likelihood (SLiC) and Direct Preference Optimization (DPO) from a preference modeling standpoint.
arXiv Detail & Related papers (2023-09-13T01:07:25Z) - Reinforcement Learning with Human Feedback: Learning Dynamic Choices via
Pessimism [91.52263068880484]
We study offline Reinforcement Learning with Human Feedback (RLHF)
We aim to learn the human's underlying reward and the MDP's optimal policy from a set of trajectories induced by human choices.
RLHF is challenging for multiple reasons: large state space but limited human feedback, the bounded rationality of human decisions, and the off-policy distribution shift.
arXiv Detail & Related papers (2023-05-29T01:18:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.