Related papers: ODIN: Disentangled Reward Mitigates Hacking in RLHF

ODIN: Disentangled Reward Mitigates Hacking in RLHF

URL: http://arxiv.org/abs/2402.07319v1
Date: Sun, 11 Feb 2024 22:40:12 GMT
Title: ODIN: Disentangled Reward Mitigates Hacking in RLHF
Authors: Lichang Chen, Chen Zhu, Davit Soselia, Jiuhai Chen, Tianyi Zhou, Tom Goldstein, Heng Huang, Mohammad Shoeybi, Bryan Catanzaro
Abstract summary: We study the issue of reward hacking on the response length, a challenge emerging in Reinforcement Learning from Human Feedback. A well-formatted, verbose but less helpful response from the LLMs can often deceive LLMs or even human evaluators to achieve high scores. Our approach almost eliminates the reward correlation with length, and improves the obtained policy by a significant margin.
Score: 127.35607931337019
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this work, we study the issue of reward hacking on the response length, a challenge emerging in Reinforcement Learning from Human Feedback (RLHF) on LLMs. A well-formatted, verbose but less helpful response from the LLMs can often deceive LLMs or even human evaluators to achieve high scores. The same issue also holds for some reward models in RL. To address the challenges in both training and evaluation, we establish a more reliable evaluation protocol for comparing different training configurations, which inspects the trade-off between LLM evaluation score and response length obtained by varying training hyperparameters. Based on this evaluation, we conduct large-scale studies, where the results shed insights into the efficacy of hyperparameters and tricks used in RL on mitigating length bias. We further propose to improve the reward model by jointly training two linear heads on shared feature representations to predict the rewards, one trained to correlate with length, and the other trained to decorrelate with length and therefore focus more on the actual content. We then discard the length head in RL to prevent reward hacking on length. Experiments demonstrate that our approach almost eliminates the reward correlation with length, and improves the obtained policy by a significant margin.

Related papers

CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models [29.95434387343843]
We propose a unified framework that mitigates length bias through three components.<n>CoLD consistently reduces reward-length correlation, improves accuracy in step selection, and encourages more concise, logically valid reasoning.
arXiv Detail & Related papers (2025-07-21T15:07:59Z)
Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback [52.1410307583181]
We useReinforcement Learning from Human Feedback to train language models (LMs) to follow complex human preferences.<n>As training progresses, the responses generated by the LM no longer resemble the responses seen by the reward model (RM)<n>We propose Off-Policy Corrected Reward Modeling to correct the RM using importance weighting, without requiring new labels or samples.
arXiv Detail & Related papers (2025-07-21T11:19:04Z)
Bias Fitting to Mitigate Length Bias of Reward Model in RLHF [81.44256822500257]
Reinforcement Learning from Human Feedback relies on reward models to align large language models with human preferences.<n>We propose FiMi-RM, a framework that autonomously learns and corrects underlying bias patterns.<n> Experimental results demonstrate that FiMi-RM achieves a more balanced length-reward distribution.
arXiv Detail & Related papers (2025-05-19T08:29:28Z)
Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback [12.7099489697479]
We introduce a hybrid reward system combining reasoning task verifiers (RTV) and a generative reward model (GenRM) to mitigate reward hacking. We also propose a novel prompt-selection method, Pre-PPO, to maintain response diversity and enhance learning effectiveness.
arXiv Detail & Related papers (2025-03-28T08:26:41Z)
Disentangling Length Bias In Preference Learning Via Response-Conditioned Modeling [87.17041933863041]
We introduce a Response-conditioned Bradley-Terry (Rc-BT) model that enhances the reward model's capability in length bias mitigating and length instruction following. We also propose the Rc-DPO algorithm to leverage the Rc-BT model for direct policy optimization (DPO) of large language models.
arXiv Detail & Related papers (2025-02-02T14:50:25Z)
How to Evaluate Reward Models for RLHF [51.31240621943791]
We introduce a new benchmark for reward models that quantifies their ability to produce strong language models through RLHF (Reinforcement Learning from Human Feedback) We build a predictive model of downstream LLM performance by evaluating the reward model on proxy tasks. We launch an end-to-end RLHF experiment on a large-scale crowdsourced human preference platform to view real reward model downstream performance as ground truth.
arXiv Detail & Related papers (2024-10-18T21:38:21Z)
MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions [46.608747360764035]
Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences. We propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions -- sequences of tokens or higher-level language constructs -- into the learning process. We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis.
arXiv Detail & Related papers (2024-10-03T17:55:13Z)
Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback [8.601283886845664]
Reinforcement learning from human feedback (RLHF) aligns Large language models (LLMs) with human intentions and values. Despite its effectiveness and popularity, RLHF is prone to biased local optimization. We propose a novel textitsequence-to-sequence (seq2seq) reward modeling method.
arXiv Detail & Related papers (2024-08-30T16:14:35Z)
Reward Difference Optimization For Sample Reweighting In Offline RLHF [18.62836654699957]
Current offline RLHF only captures the "ordinal relationship" between responses, overlooking the crucial aspect of how much one is preferred over the others. We propose a simple yet effective solution called Reward Difference Optimization, shorted as RDO. Experiments with 7B LLMs on the HH and TL;DR datasets substantiate the effectiveness of our method in both automatic metrics and human evaluation.
arXiv Detail & Related papers (2024-08-18T07:04:16Z)
Teaching Large Language Models to Reason with Reinforcement Learning [38.17625148525193]
Reinforcement Learning from Human Feedback (textbfRLHF) has emerged as a dominant approach for aligning LLM outputs with human preferences. Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from feedback.
arXiv Detail & Related papers (2024-03-07T16:36:29Z)
RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models [62.72318564072706]
Reinforcement Learning with Human Feedback (RLHF) is a methodology designed to align Large Language Models (LLMs) with human preferences. Despite its advantages, RLHF relies on human annotators to rank the text. We propose RankPoison, a poisoning attack method on candidates' selection of preference rank flipping to reach certain malicious behaviors.
arXiv Detail & Related papers (2023-11-16T07:48:45Z)
A Long Way to Go: Investigating Length Correlations in RLHF [59.49656695716066]
This paper demonstrates, on three diverse settings, that optimizing for response length is a significant factor behind RLHF. We find improvements in reward to largely be driven by increasing response length, instead of other features. Even a purely length-based reward reproduces most downstream RLHF improvements over supervised fine-tuned models.
arXiv Detail & Related papers (2023-10-05T17:38:28Z)
The Trickle-down Impact of Reward (In-)consistency on RLHF [71.37987812944971]
We show that reward inconsistency exhibits a trickle-down effect on the downstream Reinforcement Learning from Human Feedback process. We propose Contrast Instructions -- a benchmarking strategy for the consistency of RM. We show that RLHF models trained with a more consistent RM yield more useful responses.
arXiv Detail & Related papers (2023-09-28T04:05:13Z)
Reward Uncertainty for Exploration in Preference-based Reinforcement Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms. Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward. Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.