Related papers: Stabilizing RLHF through Advantage Model and Selective Rehearsal

Stabilizing RLHF through Advantage Model and Selective Rehearsal

URL: http://arxiv.org/abs/2309.10202v1
Date: Mon, 18 Sep 2023 23:06:32 GMT
Title: Stabilizing RLHF through Advantage Model and Selective Rehearsal
Authors: Baolin Peng and Linfeng Song and Ye Tian and Lifeng Jin and Haitao Mi and Dong Yu
Abstract summary: Large Language Models (LLMs) have revolutionized natural language processing, yet aligning these models with human values and preferences remains a significant challenge. This challenge is characterized by various instabilities, such as reward hacking and catastrophic forgetting. We propose two innovations to stabilize RLHF training: 1) Advantage Model, which directly models advantage score and regulates score distributions across tasks to prevent reward hacking; and 2) Selective Rehearsal, which mitigates catastrophic forgetting by strategically selecting data for PPO training and knowledge rehearsing.
Score: 57.504894664689
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have revolutionized natural language processing, yet aligning these models with human values and preferences using RLHF remains a significant challenge. This challenge is characterized by various instabilities, such as reward hacking and catastrophic forgetting. In this technical report, we propose two innovations to stabilize RLHF training: 1) Advantage Model, which directly models advantage score i.e., extra reward compared to the expected rewards and regulates score distributions across tasks to prevent reward hacking. 2) Selective Rehearsal, which mitigates catastrophic forgetting by strategically selecting data for PPO training and knowledge rehearsing. Our experimental analysis on public and proprietary datasets reveals that the proposed methods not only increase stability in RLHF training but also achieve higher reward scores and win rates.

Related papers

Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback [12.7099489697479]
We introduce a hybrid reward system combining reasoning task verifiers (RTV) and a generative reward model (GenRM) to mitigate reward hacking. We also propose a novel prompt-selection method, Pre-PPO, to maintain response diversity and enhance learning effectiveness.
arXiv Detail & Related papers (2025-03-28T08:26:41Z)
Reward Shaping to Mitigate Reward Hacking in RLHF [47.71454266800376]
Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models with human values. reward shaping helps stabilize RLHF and partially mitigate reward hacking. We present a comprehensive study of the prevalent reward shaping methods. We propose Preference As Reward (PAR), a novel approach that leverages the latent preferences embedded within the reward model itself as the signal for reinforcement learning.
arXiv Detail & Related papers (2025-02-26T02:57:59Z)
Does RLHF Scale? Exploring the Impacts From Data, Model, and Method [83.53178716807776]
This study explores the scaling properties of Reinforcement Learning from Human Feedback in Large Language Models. We analyze key components in the RLHF framework--model size, data composition, and inference budget--and their impacts on performance.
arXiv Detail & Related papers (2024-12-08T17:19:48Z)
Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards [26.40009657912622]
Reinforcement learning from human feedback (RLHF) is the mainstream paradigm used to align large language models (LLMs) with human preferences. Yet existing RLHF heavily relies on accurate and informative reward models, which are vulnerable and sensitive to noise from various sources. In this work, we improve the effectiveness of the reward model by introducing a penalty term on the reward, named as textitcontrastive rewards
arXiv Detail & Related papers (2024-03-12T14:51:57Z)
Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble [67.4269821365504]
Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values. However, RLHF relies on a reward model that is trained with a limited amount of human preference data. We contribute a reward ensemble method that allows the reward model to make more accurate predictions.
arXiv Detail & Related papers (2024-01-30T00:17:37Z)
Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF [79.98542868281471]
Reinforcement Learning from Human Feedback (RLHF) is a technique that aligns language models closely with human-centric values. It is observed that the performance of the reward model degrades after one epoch of training, and optimizing too much against the learned reward model eventually hinders the true objective. This paper delves into these issues, leveraging the theoretical insights to design improved reward learning algorithm termed 'Iterative Data Smoothing' (IDS)
arXiv Detail & Related papers (2024-01-29T17:43:42Z)
WARM: On the Benefits of Weight Averaged Reward Models [63.08179139233774]
We propose Weight Averaged Reward Models (WARM) to mitigate reward hacking. Experiments on summarization tasks, using best-of-N and RL methods, shows that WARM improves the overall quality and alignment of LLM predictions.
arXiv Detail & Related papers (2024-01-22T18:27:08Z)
Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset. We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z)
Uncertainty-Penalized Reinforcement Learning from Human Feedback with Diverse Reward LoRA Ensembles [26.955375398765085]
Reinforcement learning from human feedback (RLHF) emerges as a promising paradigm for aligning large language models (LLMs) In this paper, we observe the weakness of KL regularization which is commonly employed in existing RLHF methods to address overoptimization. We propose uncertainty-penalized RLHF (UP-RLHF), which incorporates uncertainty regularization during RL-finetuning.
arXiv Detail & Related papers (2023-12-30T14:14:14Z)
SuperHF: Supervised Iterative Learning from Human Feedback [20.22920163075946]
We focus on two prevalent methods used to align large language models, Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) We propose a novel approach, Supervised Iterative Learning from Human Feedback (SuperHF), which seeks to leverage the strengths of both methods. Our experimental results show SuperHF exceeds PPO-based RLHF on the training objective, easily and favorably trades off high reward with low reward hacking, improves downstream calibration, and performs the same on our GPT-4 based qualitative evaluation scheme all the while being significantly simpler to implement.
arXiv Detail & Related papers (2023-10-25T16:52:00Z)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.