Confronting Reward Model Overoptimization with Constrained RLHF
- URL: http://arxiv.org/abs/2310.04373v2
- Date: Tue, 10 Oct 2023 15:01:11 GMT
- Title: Confronting Reward Model Overoptimization with Constrained RLHF
- Authors: Ted Moskovitz, Aaditya K. Singh, DJ Strouse, Tuomas Sandholm, Ruslan
Salakhutdinov, Anca D. Dragan, Stephen McAleer
- Abstract summary: We show that correlation between component RMs has a significant effect on the locations of these points.
Our method addresses the problem of weighting component RMs by learning dynamic weights, naturally expressed by Lagrange multipliers.
- Score: 114.71591361764547
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large language models are typically aligned with human preferences by
optimizing $\textit{reward models}$ (RMs) fitted to human feedback. However,
human preferences are multi-faceted, and it is increasingly common to derive
reward from a composition of simpler reward models which each capture a
different aspect of language quality. This itself presents a challenge, as it
is difficult to appropriately weight these component RMs when combining them.
Compounding this difficulty, because any RM is only a proxy for human
evaluation, this process is vulnerable to $\textit{overoptimization}$, wherein
past a certain point, accumulating higher reward is associated with worse human
ratings. In this paper, we perform, to our knowledge, the first study on
overoptimization in composite RMs, showing that correlation between component
RMs has a significant effect on the locations of these points. We then
introduce an approach to solve this issue using constrained reinforcement
learning as a means of preventing the agent from exceeding each RM's threshold
of usefulness. Our method addresses the problem of weighting component RMs by
learning dynamic weights, naturally expressed by Lagrange multipliers. As a
result, each RM stays within the range at which it is an effective proxy,
improving evaluation performance. Finally, we introduce an adaptive method
using gradient-free optimization to identify and optimize towards these points
during a single run.
Related papers
- R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback [25.27230140274847]
Reinforcement learning from human feedback (RLHF) provides a paradigm for aligning large language models (LLMs) with human preferences.
This paper proposes a novel reward redistribution method called R3HF, which facilitates a more fine-grained, token-level reward allocation.
arXiv Detail & Related papers (2024-11-13T02:45:21Z) - RRM: Robust Reward Model Training Mitigates Reward Hacking [51.12341734942797]
Reward models (RMs) play a pivotal role in aligning large language models with human preferences.
We introduce a causal framework that learns preferences independent of these artifacts.
Experiments show that our approach successfully filters out undesirable artifacts, yielding a more robust reward model.
arXiv Detail & Related papers (2024-09-20T01:46:07Z) - Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts [23.27203570485055]
Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models with human preferences.
We propose a two-stage approach to train a reward model (RM) with multi-dimensional absolute-rating data.
We efficiently trained an ArmoRM with Llama-3 8B and a gating network consisting of a shallow on top of the ArmoRM.
arXiv Detail & Related papers (2024-06-18T17:58:28Z) - Adaptive Preference Scaling for Reinforcement Learning with Human Feedback [103.36048042664768]
Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values.
We propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO)
Our method is versatile and can be readily adapted to various preference optimization frameworks.
arXiv Detail & Related papers (2024-06-04T20:33:22Z) - Prior Constraints-based Reward Model Training for Aligning Large Language Models [58.33118716810208]
This paper proposes a Prior Constraints-based Reward Model (namely PCRM) training method to mitigate this problem.
PCRM incorporates prior constraints, specifically, length ratio and cosine similarity between outputs of each comparison pair, during reward model training to regulate optimization magnitude and control score margins.
Experimental results demonstrate that PCRM significantly improves alignment performance by effectively constraining reward score scaling.
arXiv Detail & Related papers (2024-04-01T07:49:11Z) - WARM: On the Benefits of Weight Averaged Reward Models [63.08179139233774]
We propose Weight Averaged Reward Models (WARM) to mitigate reward hacking.
Experiments on summarization tasks, using best-of-N and RL methods, shows that WARM improves the overall quality and alignment of LLM predictions.
arXiv Detail & Related papers (2024-01-22T18:27:08Z) - The Trickle-down Impact of Reward (In-)consistency on RLHF [71.37987812944971]
We show that reward inconsistency exhibits a trickle-down effect on the downstream Reinforcement Learning from Human Feedback process.
We propose Contrast Instructions -- a benchmarking strategy for the consistency of RM.
We show that RLHF models trained with a more consistent RM yield more useful responses.
arXiv Detail & Related papers (2023-09-28T04:05:13Z) - A Generalised Inverse Reinforcement Learning Framework [24.316047317028147]
inverse Reinforcement Learning (IRL) is to estimate the unknown cost function of some MDP base on observed trajectories.
We introduce an alternative training loss that puts more weights on future states which yields a reformulation of the (maximum entropy) IRL problem.
The algorithms we devised exhibit enhanced performances (and similar tractability) than off-the-shelf ones in multiple OpenAI gym environments.
arXiv Detail & Related papers (2021-05-25T10:30:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.