The Trickle-down Impact of Reward (In-)consistency on RLHF
- URL: http://arxiv.org/abs/2309.16155v1
- Date: Thu, 28 Sep 2023 04:05:13 GMT
- Title: The Trickle-down Impact of Reward (In-)consistency on RLHF
- Authors: Lingfeng Shen and Sihao Chen and Linfeng Song and Lifeng Jin and
Baolin Peng and Haitao Mi and Daniel Khashabi and Dong Yu
- Abstract summary: We show that reward inconsistency exhibits a trickle-down effect on the downstream Reinforcement Learning from Human Feedback process.
We propose Contrast Instructions -- a benchmarking strategy for the consistency of RM.
We show that RLHF models trained with a more consistent RM yield more useful responses.
- Score: 71.37987812944971
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Standard practice within Reinforcement Learning from Human Feedback (RLHF)
involves optimizing against a Reward Model (RM), which itself is trained to
reflect human preferences for desirable generations. A notable subject that is
understudied is the (in-)consistency of RMs -- whether they can recognize the
semantic changes to different prompts and appropriately adapt their reward
assignments -- and their impact on the downstream RLHF model.
In this paper, we visit a series of research questions relevant to RM
inconsistency: (1) How can we measure the consistency of reward models? (2) How
consistent are the existing RMs and how can we improve them? (3) In what ways
does reward inconsistency influence the chatbots resulting from the RLHF model
training?
We propose Contrast Instructions -- a benchmarking strategy for the
consistency of RM. Each example in Contrast Instructions features a pair of
lexically similar instructions with different ground truth responses. A
consistent RM is expected to rank the corresponding instruction and response
higher than other combinations. We observe that current RMs trained with the
standard ranking objective fail miserably on Contrast Instructions compared to
average humans. To show that RM consistency can be improved efficiently without
using extra training budget, we propose two techniques ConvexDA and
RewardFusion, which enhance reward consistency through extrapolation during the
RM training and inference stage, respectively. We show that RLHF models trained
with a more consistent RM yield more useful responses, suggesting that reward
inconsistency exhibits a trickle-down effect on the downstream RLHF process.
Related papers
- RRM: Robust Reward Model Training Mitigates Reward Hacking [51.12341734942797]
Reward models (RMs) play a pivotal role in aligning large language models with human preferences.
We introduce a causal framework that learns preferences independent of these artifacts.
Experiments show that our approach successfully filters out undesirable artifacts, yielding a more robust reward model.
arXiv Detail & Related papers (2024-09-20T01:46:07Z) - Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback [8.601283886845664]
Reinforcement learning from human feedback (RLHF) aligns Large language models (LLMs) with human intentions and values.
Despite its effectiveness and popularity, RLHF is prone to biased local optimization.
We propose a novel textitsequence-to-sequence (seq2seq) reward modeling method.
arXiv Detail & Related papers (2024-08-30T16:14:35Z) - Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts [23.27203570485055]
Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models with human preferences.
We propose a two-stage approach to train a reward model (RM) with multi-dimensional absolute-rating data.
We efficiently trained an ArmoRM with Llama-3 8B and a gating network consisting of a shallow on top of the ArmoRM.
arXiv Detail & Related papers (2024-06-18T17:58:28Z) - It Takes Two: On the Seamlessness between Reward and Policy Model in RLHF [33.197077764166536]
Reinforcement Learning from Human Feedback involves training policy models (PMs) and reward models (RMs) to align language models with human preferences.
Instead of focusing solely on PMs and RMs independently, we propose to examine their interactions during fine-tuning.
Our study starts with observing the saturation phenomenon, where continual improvements in RM and PM do not translate into RLHF progress.
Our analysis shows that RMs fail to assign proper scores to PM responses, resulting in a 35% mismatch rate with human preferences, highlighting a significant discrepancy between PM and RM.
arXiv Detail & Related papers (2024-06-12T07:52:17Z) - ODIN: Disentangled Reward Mitigates Hacking in RLHF [127.35607931337019]
We study the issue of reward hacking on the response length, a challenge emerging in Reinforcement Learning from Human Feedback.
A well-formatted, verbose but less helpful response from the LLMs can often deceive LLMs or even human evaluators to achieve high scores.
Our approach almost eliminates the reward correlation with length, and improves the obtained policy by a significant margin.
arXiv Detail & Related papers (2024-02-11T22:40:12Z) - WARM: On the Benefits of Weight Averaged Reward Models [63.08179139233774]
We propose Weight Averaged Reward Models (WARM) to mitigate reward hacking.
Experiments on summarization tasks, using best-of-N and RL methods, shows that WARM improves the overall quality and alignment of LLM predictions.
arXiv Detail & Related papers (2024-01-22T18:27:08Z) - Confronting Reward Model Overoptimization with Constrained RLHF [114.71591361764547]
We show that correlation between component RMs has a significant effect on the locations of these points.
Our method addresses the problem of weighting component RMs by learning dynamic weights, naturally expressed by Lagrange multipliers.
arXiv Detail & Related papers (2023-10-06T16:59:17Z) - RRHF: Rank Responses to Align Language Models with Human Feedback
without tears [69.68672043223249]
InstructGPT implements RLHF through several stages, including Supervised Fine-Tuning (SFT), reward model training, and Proximal Policy Optimization (PPO)
We propose a novel learning paradigm called RRHF, which scores sampled responses from different sources via a logarithm of conditional probabilities.
We evaluate RRHF on the Helpful and Harmless dataset, demonstrating comparable alignment performance with PPO by reward model score and human labeling.
arXiv Detail & Related papers (2023-04-11T15:53:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.