ODIN: Disentangled Reward Mitigates Hacking in RLHF
- URL: http://arxiv.org/abs/2402.07319v1
- Date: Sun, 11 Feb 2024 22:40:12 GMT
- Title: ODIN: Disentangled Reward Mitigates Hacking in RLHF
- Authors: Lichang Chen, Chen Zhu, Davit Soselia, Jiuhai Chen, Tianyi Zhou, Tom
Goldstein, Heng Huang, Mohammad Shoeybi, Bryan Catanzaro
- Abstract summary: We study the issue of reward hacking on the response length, a challenge emerging in Reinforcement Learning from Human Feedback.
A well-formatted, verbose but less helpful response from the LLMs can often deceive LLMs or even human evaluators to achieve high scores.
Our approach almost eliminates the reward correlation with length, and improves the obtained policy by a significant margin.
- Score: 127.35607931337019
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we study the issue of reward hacking on the response length, a
challenge emerging in Reinforcement Learning from Human Feedback (RLHF) on
LLMs. A well-formatted, verbose but less helpful response from the LLMs can
often deceive LLMs or even human evaluators to achieve high scores. The same
issue also holds for some reward models in RL. To address the challenges in
both training and evaluation, we establish a more reliable evaluation protocol
for comparing different training configurations, which inspects the trade-off
between LLM evaluation score and response length obtained by varying training
hyperparameters. Based on this evaluation, we conduct large-scale studies,
where the results shed insights into the efficacy of hyperparameters and tricks
used in RL on mitigating length bias. We further propose to improve the reward
model by jointly training two linear heads on shared feature representations to
predict the rewards, one trained to correlate with length, and the other
trained to decorrelate with length and therefore focus more on the actual
content. We then discard the length head in RL to prevent reward hacking on
length. Experiments demonstrate that our approach almost eliminates the reward
correlation with length, and improves the obtained policy by a significant
margin.
Related papers
- How to Evaluate Reward Models for RLHF [51.31240621943791]
We introduce a new benchmark for reward models that quantifies their ability to produce strong language models through RLHF (Reinforcement Learning from Human Feedback)
We build a predictive model of downstream LLM performance by evaluating the reward model on proxy tasks.
We launch an end-to-end RLHF experiment on a large-scale crowdsourced human preference platform to view real reward model downstream performance as ground truth.
arXiv Detail & Related papers (2024-10-18T21:38:21Z) - MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions [46.608747360764035]
Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences.
We propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions -- sequences of tokens or higher-level language constructs -- into the learning process.
We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis.
arXiv Detail & Related papers (2024-10-03T17:55:13Z) - Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback [8.601283886845664]
Reinforcement learning from human feedback (RLHF) aligns Large language models (LLMs) with human intentions and values.
Despite its effectiveness and popularity, RLHF is prone to biased local optimization.
We propose a novel textitsequence-to-sequence (seq2seq) reward modeling method.
arXiv Detail & Related papers (2024-08-30T16:14:35Z) - Reward Difference Optimization For Sample Reweighting In Offline RLHF [18.62836654699957]
Current offline RLHF only captures the "ordinal relationship" between responses, overlooking the crucial aspect of how much one is preferred over the others.
We propose a simple yet effective solution called Reward Difference Optimization, shorted as RDO.
Experiments with 7B LLMs on the HH and TL;DR datasets substantiate the effectiveness of our method in both automatic metrics and human evaluation.
arXiv Detail & Related papers (2024-08-18T07:04:16Z) - Teaching Large Language Models to Reason with Reinforcement Learning [38.17625148525193]
Reinforcement Learning from Human Feedback (textbfRLHF) has emerged as a dominant approach for aligning LLM outputs with human preferences.
Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from feedback.
arXiv Detail & Related papers (2024-03-07T16:36:29Z) - RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models [62.72318564072706]
Reinforcement Learning with Human Feedback (RLHF) is a methodology designed to align Large Language Models (LLMs) with human preferences.
Despite its advantages, RLHF relies on human annotators to rank the text.
We propose RankPoison, a poisoning attack method on candidates' selection of preference rank flipping to reach certain malicious behaviors.
arXiv Detail & Related papers (2023-11-16T07:48:45Z) - A Long Way to Go: Investigating Length Correlations in RLHF [59.49656695716066]
This paper demonstrates, on three diverse settings, that optimizing for response length is a significant factor behind RLHF.
We find improvements in reward to largely be driven by increasing response length, instead of other features.
Even a purely length-based reward reproduces most downstream RLHF improvements over supervised fine-tuned models.
arXiv Detail & Related papers (2023-10-05T17:38:28Z) - The Trickle-down Impact of Reward (In-)consistency on RLHF [71.37987812944971]
We show that reward inconsistency exhibits a trickle-down effect on the downstream Reinforcement Learning from Human Feedback process.
We propose Contrast Instructions -- a benchmarking strategy for the consistency of RM.
We show that RLHF models trained with a more consistent RM yield more useful responses.
arXiv Detail & Related papers (2023-09-28T04:05:13Z) - Reward Uncertainty for Exploration in Preference-based Reinforcement
Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms.
Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward.
Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.