A Long Way to Go: Investigating Length Correlations in RLHF
- URL: http://arxiv.org/abs/2310.03716v2
- Date: Wed, 10 Jul 2024 23:15:49 GMT
- Title: A Long Way to Go: Investigating Length Correlations in RLHF
- Authors: Prasann Singhal, Tanya Goyal, Jiacheng Xu, Greg Durrett,
- Abstract summary: This paper demonstrates, on three diverse settings, that optimizing for response length is a significant factor behind RLHF.
We find improvements in reward to largely be driven by increasing response length, instead of other features.
Even a purely length-based reward reproduces most downstream RLHF improvements over supervised fine-tuned models.
- Score: 59.49656695716066
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Great success has been reported using Reinforcement Learning from Human Feedback (RLHF) to align large language models, with open preference datasets enabling wider experimentation, particularly for "helpfulness" in tasks like dialogue and web question answering. Alongside these improvements, however, RLHF also often drives models to produce longer outputs. This paper demonstrates, on three diverse settings, that optimizing for response length is, much more than previously thought, a significant factor behind RLHF. Studying the strategies RL optimization uses to maximize reward, we find improvements in reward to largely be driven by increasing response length, instead of other features. Indeed, we find that even a purely length-based reward reproduces most downstream RLHF improvements over supervised fine-tuned models. Testing a comprehensive set of length-countering interventions, we identify the dominant source of these biases to be reward models, which, by studying training dynamics, we find are non-robust and easily influenced by length biases in preference data.
Related papers
- Disentangling Length Bias In Preference Learning Via Response-Conditioned Modeling [87.17041933863041]
We introduce a Response-conditioned Bradley-Terry (Rc-BT) model that enhances the reward model's capability in length bias mitigating and length instruction following.
We also propose the Rc-DPO algorithm to leverage the Rc-BT model for direct policy optimization (DPO) of large language models.
arXiv Detail & Related papers (2025-02-02T14:50:25Z) - Does RLHF Scale? Exploring the Impacts From Data, Model, and Method [83.53178716807776]
This study explores the scaling properties of Reinforcement Learning from Human Feedback in Large Language Models.
We analyze key components in the RLHF framework--model size, data composition, and inference budget--and their impacts on performance.
arXiv Detail & Related papers (2024-12-08T17:19:48Z) - How to Evaluate Reward Models for RLHF [51.31240621943791]
We introduce a new benchmark for reward models that quantifies their ability to produce strong language models through RLHF (Reinforcement Learning from Human Feedback)
We build a predictive model of downstream LLM performance by evaluating the reward model on proxy tasks.
We launch an end-to-end RLHF experiment on a large-scale crowdsourced human preference platform to view real reward model downstream performance as ground truth.
arXiv Detail & Related papers (2024-10-18T21:38:21Z) - Measuring memorization in RLHF for code completion [18.3607188787591]
Reinforcement learning with human feedback (RLHF) has become the dominant method to align large models to user preferences.
We analyze how training data memorization can surface and propagate through each phase of RLHF and direct preference learning.
Our work suggests that RLHF, as opposed to direct preference learning, is a safer way to mitigate the risk of regurgitating sensitive preference data when aligning large language models.
arXiv Detail & Related papers (2024-06-17T16:33:35Z) - Disentangling Length from Quality in Direct Preference Optimization [93.74831404396174]
Reinforcement Learning from Human Feedback (RLHF) has been a crucial component in the recent success of Large Language Models.
RLHF is know to exploit biases in human preferences, such as verbosity.
We develop a principled but simple regularization strategy that prevents length exploitation, while still maintaining improvements in model quality.
arXiv Detail & Related papers (2024-03-28T06:03:47Z) - ODIN: Disentangled Reward Mitigates Hacking in RLHF [127.35607931337019]
We study the issue of reward hacking on the response length, a challenge emerging in Reinforcement Learning from Human Feedback.
A well-formatted, verbose but less helpful response from the LLMs can often deceive LLMs or even human evaluators to achieve high scores.
Our approach almost eliminates the reward correlation with length, and improves the obtained policy by a significant margin.
arXiv Detail & Related papers (2024-02-11T22:40:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.