Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback
- URL: http://arxiv.org/abs/2507.15507v1
- Date: Mon, 21 Jul 2025 11:19:04 GMT
- Title: Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback
- Authors: Johannes Ackermann, Takashi Ishida, Masashi Sugiyama,
- Abstract summary: We useReinforcement Learning from Human Feedback to train language models (LMs) to follow complex human preferences.<n>As training progresses, the responses generated by the LM no longer resemble the responses seen by the reward model (RM)<n>We propose Off-Policy Corrected Reward Modeling to correct the RM using importance weighting, without requiring new labels or samples.
- Score: 52.1410307583181
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement Learning from Human Feedback (RLHF) allows us to train models, such as language models (LMs), to follow complex human preferences. In RLHF for LMs, we first train an LM using supervised fine-tuning, sample pairs of responses, obtain human feedback, and use the resulting data to train a reward model (RM). RL methods are then used to train the LM to maximize the reward given by the RM. As training progresses, the responses generated by the LM no longer resemble the responses seen by the RM during training, leading to the RM becoming inaccurate. The score given by the RM keeps increasing, but the learned behavior no longer matches the human preferences. This issue is known as overoptimization. We investigate overoptimization from the point of view of distribution shift and show that the shift results in an inconsistent estimate of the RM parameters, leading to an inconsistent estimate of the policy gradient. We propose Off-Policy Corrected Reward Modeling (OCRM), which iteratively off-policy corrects the RM using importance weighting, without requiring new labels or samples. This results in a more accurate RM, which empirically leads to an improved final policy. We validate our approach in experiments with summarization and chatbot datasets and show that it performs significantly better than standard RLHF methods and baselines. Our implementation is available at https://github.com/JohannesAck/OffPolicyCorrectedRewardModeling
Related papers
- RM-R1: Reward Modeling as Reasoning [81.50471199906738]
Reasoning Reward Models (ReasRMs) formulate reward modeling as a reasoning task.<n>We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1.<n>Our models achieve state-of-the-art performance across three reward model benchmarks on average.
arXiv Detail & Related papers (2025-05-05T06:11:12Z) - Free Process Rewards without Process Labels [55.14044050782222]
We show that an textitimplicit PRM can be obtained at no additional cost, by simply training an ORM on the cheaper response-level labels.<n>We show that our implicit PRM, when instantiated with the cross-entropy (CE) loss, is more data-efficient and can keep improving generation models even when trained with only one response per instruction.
arXiv Detail & Related papers (2024-12-02T21:20:02Z) - Post-hoc Reward Calibration: A Case Study on Length Bias [28.266675778940133]
Reward models (RMs) can develop biases by exploiting spurious correlations in their training data.
These biases can lead to incorrect output rankings, sub-optimal model evaluations, and the amplification of undesirable behaviours.
This paper addresses the challenge of correcting such biases without additional data and training.
arXiv Detail & Related papers (2024-09-25T22:30:42Z) - RRM: Robust Reward Model Training Mitigates Reward Hacking [51.12341734942797]
Reward models (RMs) play a pivotal role in aligning large language models with human preferences.<n>We introduce a causal framework that learns preferences independent of these artifacts.<n>Experiments show that our approach successfully filters out undesirable artifacts, yielding a more robust reward model.
arXiv Detail & Related papers (2024-09-20T01:46:07Z) - Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback [8.601283886845664]
Reinforcement learning from human feedback (RLHF) aligns Large language models (LLMs) with human intentions and values.
Despite its effectiveness and popularity, RLHF is prone to biased local optimization.
We propose a novel textitsequence-to-sequence (seq2seq) reward modeling method.
arXiv Detail & Related papers (2024-08-30T16:14:35Z) - Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts [23.27203570485055]
Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models with human preferences.
We propose a two-stage approach to train a reward model (RM) with multi-dimensional absolute-rating data.
We efficiently trained an ArmoRM with Llama-3 8B and a gating network consisting of a shallow on top of the ArmoRM.
arXiv Detail & Related papers (2024-06-18T17:58:28Z) - WARM: On the Benefits of Weight Averaged Reward Models [63.08179139233774]
We propose Weight Averaged Reward Models (WARM) to mitigate reward hacking.
Experiments on summarization tasks, using best-of-N and RL methods, shows that WARM improves the overall quality and alignment of LLM predictions.
arXiv Detail & Related papers (2024-01-22T18:27:08Z) - The Trickle-down Impact of Reward (In-)consistency on RLHF [71.37987812944971]
We show that reward inconsistency exhibits a trickle-down effect on the downstream Reinforcement Learning from Human Feedback process.
We propose Contrast Instructions -- a benchmarking strategy for the consistency of RM.
We show that RLHF models trained with a more consistent RM yield more useful responses.
arXiv Detail & Related papers (2023-09-28T04:05:13Z) - Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form.
The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight.
Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.