Related papers: A Long Way to Go: Investigating Length Correlations in RLHF

A Long Way to Go: Investigating Length Correlations in RLHF

URL: http://arxiv.org/abs/2310.03716v2
Date: Wed, 10 Jul 2024 23:15:49 GMT
Title: A Long Way to Go: Investigating Length Correlations in RLHF
Authors: Prasann Singhal, Tanya Goyal, Jiacheng Xu, Greg Durrett,
Abstract summary: This paper demonstrates, on three diverse settings, that optimizing for response length is a significant factor behind RLHF. We find improvements in reward to largely be driven by increasing response length, instead of other features. Even a purely length-based reward reproduces most downstream RLHF improvements over supervised fine-tuned models.
Score: 59.49656695716066
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Great success has been reported using Reinforcement Learning from Human Feedback (RLHF) to align large language models, with open preference datasets enabling wider experimentation, particularly for "helpfulness" in tasks like dialogue and web question answering. Alongside these improvements, however, RLHF also often drives models to produce longer outputs. This paper demonstrates, on three diverse settings, that optimizing for response length is, much more than previously thought, a significant factor behind RLHF. Studying the strategies RL optimization uses to maximize reward, we find improvements in reward to largely be driven by increasing response length, instead of other features. Indeed, we find that even a purely length-based reward reproduces most downstream RLHF improvements over supervised fine-tuned models. Testing a comprehensive set of length-countering interventions, we identify the dominant source of these biases to be reward models, which, by studying training dynamics, we find are non-robust and easily influenced by length biases in preference data.

Related papers

Concise Reasoning via Reinforcement Learning [13.657506042120167]
We revisit the core principles of reinforcement learning (RL) We uncover a natural correlation between conciseness and accuracy that has been largely overlooked. We show that introducing a secondary phase of RL post-training, using a small set of problems and limited resources, can significantly reduce a model's chain of thought.
arXiv Detail & Related papers (2025-04-07T15:35:54Z)
Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback [12.7099489697479]
We introduce a hybrid reward system combining reasoning task verifiers (RTV) and a generative reward model (GenRM) to mitigate reward hacking. We also propose a novel prompt-selection method, Pre-PPO, to maintain response diversity and enhance learning effectiveness.
arXiv Detail & Related papers (2025-03-28T08:26:41Z)
Disentangling Length Bias In Preference Learning Via Response-Conditioned Modeling [87.17041933863041]
We introduce a Response-conditioned Bradley-Terry (Rc-BT) model that enhances the reward model's capability in length bias mitigating and length instruction following. We also propose the Rc-DPO algorithm to leverage the Rc-BT model for direct policy optimization (DPO) of large language models.
arXiv Detail & Related papers (2025-02-02T14:50:25Z)
Does RLHF Scale? Exploring the Impacts From Data, Model, and Method [83.53178716807776]
This study explores the scaling properties of Reinforcement Learning from Human Feedback in Large Language Models. We analyze key components in the RLHF framework--model size, data composition, and inference budget--and their impacts on performance.
arXiv Detail & Related papers (2024-12-08T17:19:48Z)
LongReward: Improving Long-context Large Language Models with AI Feedback [54.3321542678909]
LongReward is a novel method that provides rewards for long-context model responses from four human-valued dimensions. Our experiments indicate that LongReward not only significantly improves models' long-context performance but also enhances their ability to follow short instructions.
arXiv Detail & Related papers (2024-10-28T17:50:42Z)
How to Evaluate Reward Models for RLHF [51.31240621943791]
We introduce a new benchmark for reward models that quantifies their ability to produce strong language models through RLHF (Reinforcement Learning from Human Feedback) We build a predictive model of downstream LLM performance by evaluating the reward model on proxy tasks. We launch an end-to-end RLHF experiment on a large-scale crowdsourced human preference platform to view real reward model downstream performance as ground truth.
arXiv Detail & Related papers (2024-10-18T21:38:21Z)
Measuring memorization in RLHF for code completion [18.3607188787591]
Reinforcement learning with human feedback (RLHF) has become the dominant method to align large models to user preferences. We analyze how training data memorization can surface and propagate through each phase of RLHF and direct preference learning. Our work suggests that RLHF, as opposed to direct preference learning, is a safer way to mitigate the risk of regurgitating sensitive preference data when aligning large language models.
arXiv Detail & Related papers (2024-06-17T16:33:35Z)
Disentangling Length from Quality in Direct Preference Optimization [93.74831404396174]
Reinforcement Learning from Human Feedback (RLHF) has been a crucial component in the recent success of Large Language Models. RLHF is know to exploit biases in human preferences, such as verbosity. We develop a principled but simple regularization strategy that prevents length exploitation, while still maintaining improvements in model quality.
arXiv Detail & Related papers (2024-03-28T06:03:47Z)
ODIN: Disentangled Reward Mitigates Hacking in RLHF [127.35607931337019]
We study the issue of reward hacking on the response length, a challenge emerging in Reinforcement Learning from Human Feedback. A well-formatted, verbose but less helpful response from the LLMs can often deceive LLMs or even human evaluators to achieve high scores. Our approach almost eliminates the reward correlation with length, and improves the obtained policy by a significant margin.
arXiv Detail & Related papers (2024-02-11T22:40:12Z)
Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble [67.4269821365504]
Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values. However, RLHF relies on a reward model that is trained with a limited amount of human preference data. We contribute a reward ensemble method that allows the reward model to make more accurate predictions.
arXiv Detail & Related papers (2024-01-30T00:17:37Z)
The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback [5.037876196534672]
Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make large language models (LLMs) more capable in complex settings. In this paper, we illustrate the causes of this issue, reviewing relevant literature from model-based reinforcement learning, and argue for solutions.
arXiv Detail & Related papers (2023-10-31T21:52:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.