ODIN: Disentangled Reward Mitigates Hacking in RLHF
- URL: http://arxiv.org/abs/2402.07319v1
- Date: Sun, 11 Feb 2024 22:40:12 GMT
- Title: ODIN: Disentangled Reward Mitigates Hacking in RLHF
- Authors: Lichang Chen, Chen Zhu, Davit Soselia, Jiuhai Chen, Tianyi Zhou, Tom
Goldstein, Heng Huang, Mohammad Shoeybi, Bryan Catanzaro
- Abstract summary: We study the issue of reward hacking on the response length, a challenge emerging in Reinforcement Learning from Human Feedback.
A well-formatted, verbose but less helpful response from the LLMs can often deceive LLMs or even human evaluators to achieve high scores.
Our approach almost eliminates the reward correlation with length, and improves the obtained policy by a significant margin.
- Score: 127.35607931337019
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we study the issue of reward hacking on the response length, a
challenge emerging in Reinforcement Learning from Human Feedback (RLHF) on
LLMs. A well-formatted, verbose but less helpful response from the LLMs can
often deceive LLMs or even human evaluators to achieve high scores. The same
issue also holds for some reward models in RL. To address the challenges in
both training and evaluation, we establish a more reliable evaluation protocol
for comparing different training configurations, which inspects the trade-off
between LLM evaluation score and response length obtained by varying training
hyperparameters. Based on this evaluation, we conduct large-scale studies,
where the results shed insights into the efficacy of hyperparameters and tricks
used in RL on mitigating length bias. We further propose to improve the reward
model by jointly training two linear heads on shared feature representations to
predict the rewards, one trained to correlate with length, and the other
trained to decorrelate with length and therefore focus more on the actual
content. We then discard the length head in RL to prevent reward hacking on
length. Experiments demonstrate that our approach almost eliminates the reward
correlation with length, and improves the obtained policy by a significant
margin.
Related papers
- ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback [13.154512864498912]
We propose a two-stage algorithm ARES that Alternates REinforcement Learning (RL) and Supervised Fine-Tuning (SFT)
First, we request the Teacher to score how much each sentence contributes to solving the problem in a Chain-of-Thought (CoT)
Second, we ask the Teacher to correct the wrong reasoning after the RL stage. With the correction feedback, we stabilize the RL fine-tuned model through SFT.
arXiv Detail & Related papers (2024-06-25T07:20:11Z) - PERL: Parameter Efficient Reinforcement Learning from Human Feedback [27.687265760622918]
Reinforcement Learning from Human Feedback (RLHF) has proven to be a strong method to align Large Language Models with human preferences.
We study RLHF where the underlying models are trained using the parameter efficient method of Low-Rank Adaptation (LoRA) introduced by Hu et al.
We find that PERL performs on par with the conventional RLHF setting, while training faster, and with less memory.
arXiv Detail & Related papers (2024-03-15T21:43:46Z) - Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards [26.40009657912622]
Reinforcement learning from human feedback (RLHF) is the mainstream paradigm used to align large language models (LLMs) with human preferences.
Yet existing RLHF heavily relies on accurate and informative reward models, which are vulnerable and sensitive to noise from various sources.
In this work, we improve the effectiveness of the reward model by introducing a penalty term on the reward, named as textitcontrastive rewards
arXiv Detail & Related papers (2024-03-12T14:51:57Z) - Teaching Large Language Models to Reason with Reinforcement Learning [38.17625148525193]
Reinforcement Learning from Human Feedback (textbfRLHF) has emerged as a dominant approach for aligning LLM outputs with human preferences.
Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from feedback.
arXiv Detail & Related papers (2024-03-07T16:36:29Z) - RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models [62.72318564072706]
Reinforcement Learning with Human Feedback (RLHF) is a methodology designed to align Large Language Models (LLMs) with human preferences.
Despite its advantages, RLHF relies on human annotators to rank the text.
We propose RankPoison, a poisoning attack method on candidates' selection of preference rank flipping to reach certain malicious behaviors.
arXiv Detail & Related papers (2023-11-16T07:48:45Z) - A Long Way to Go: Investigating Length Correlations in RLHF [59.49656695716066]
This paper demonstrates, on three diverse settings, that optimizing for response length is a significant factor behind RLHF.
We find improvements in reward to largely be driven by increasing response length, instead of other features.
Even a purely length-based reward reproduces most downstream RLHF improvements over supervised fine-tuned models.
arXiv Detail & Related papers (2023-10-05T17:38:28Z) - The Trickle-down Impact of Reward (In-)consistency on RLHF [71.37987812944971]
We show that reward inconsistency exhibits a trickle-down effect on the downstream Reinforcement Learning from Human Feedback process.
We propose Contrast Instructions -- a benchmarking strategy for the consistency of RM.
We show that RLHF models trained with a more consistent RM yield more useful responses.
arXiv Detail & Related papers (2023-09-28T04:05:13Z) - Reward Uncertainty for Exploration in Preference-based Reinforcement
Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms.
Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward.
Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z) - PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via
Relabeling Experience and Unsupervised Pre-training [94.87393610927812]
We present an off-policy, interactive reinforcement learning algorithm that capitalizes on the strengths of both feedback and off-policy learning.
We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by human-in-the-loop methods.
arXiv Detail & Related papers (2021-06-09T14:10:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.