It Takes Two: On the Seamlessness between Reward and Policy Model in RLHF
- URL: http://arxiv.org/abs/2406.07971v2
- Date: Thu, 13 Jun 2024 05:13:50 GMT
- Title: It Takes Two: On the Seamlessness between Reward and Policy Model in RLHF
- Authors: Taiming Lu, Lingfeng Shen, Xinyu Yang, Weiting Tan, Beidi Chen, Huaxiu Yao,
- Abstract summary: Reinforcement Learning from Human Feedback involves training policy models (PMs) and reward models (RMs) to align language models with human preferences.
Instead of focusing solely on PMs and RMs independently, we propose to examine their interactions during fine-tuning.
Our study starts with observing the saturation phenomenon, where continual improvements in RM and PM do not translate into RLHF progress.
Our analysis shows that RMs fail to assign proper scores to PM responses, resulting in a 35% mismatch rate with human preferences, highlighting a significant discrepancy between PM and RM.
- Score: 33.197077764166536
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement Learning from Human Feedback (RLHF) involves training policy models (PMs) and reward models (RMs) to align language models with human preferences. Instead of focusing solely on PMs and RMs independently, we propose to examine their interactions during fine-tuning, introducing the concept of seamlessness. Our study starts with observing the saturation phenomenon, where continual improvements in RM and PM do not translate into RLHF progress. Our analysis shows that RMs fail to assign proper scores to PM responses, resulting in a 35% mismatch rate with human preferences, highlighting a significant discrepancy between PM and RM. To measure seamlessness between PM and RM without human effort, we propose an automatic metric, SEAM. SEAM quantifies the discrepancies between PM and RM judgments induced by data samples. We validate the effectiveness of SEAM in data selection and model augmentation. Our experiments demonstrate that (1) using SEAM-filtered data for RL training improves RLHF performance by 4.5%, and (2) SEAM-guided model augmentation results in a 4% performance improvement over standard augmentation methods.
Related papers
- Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback [52.1410307583181]
We useReinforcement Learning from Human Feedback to train language models (LMs) to follow complex human preferences.<n>As training progresses, the responses generated by the LM no longer resemble the responses seen by the reward model (RM)<n>We propose Off-Policy Corrected Reward Modeling to correct the RM using importance weighting, without requiring new labels or samples.
arXiv Detail & Related papers (2025-07-21T11:19:04Z) - ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs [56.32212611983997]
We introduce ReasonFlux-PRM, a novel trajectory-aware PRM to evaluate trajectory-response type of reasoning traces.<n>ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data.<n>Our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling.
arXiv Detail & Related papers (2025-06-23T17:59:02Z) - Mutual-Taught for Co-adapting Policy and Reward Models [43.11214888109746]
We propose Mutual-Taught, a self-training method that iteratively improves both the policy model and the reward model.<n> Experimental results demonstrate that this iterative approach leads to consistent improvements in both models.
arXiv Detail & Related papers (2025-05-17T04:34:23Z) - On the Robustness of Reward Models for Language Model Alignment [9.804782604188656]
We study the cause of over-optimization in reward models trained with the Bradley-Terry (BT) model.<n>We show that the excessive dispersion of hidden state norms is the main source of over-optimization.<n>We apply BSR to high-quality data and models, which surpasses state-of-the-art RMs in the 8B scale.
arXiv Detail & Related papers (2025-05-12T06:48:26Z) - RM-R1: Reward Modeling as Reasoning [81.50471199906738]
Reasoning Reward Models (ReasRMs) formulate reward modeling as a reasoning task.<n>We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1.<n>Our models achieve state-of-the-art performance across three reward model benchmarks on average.
arXiv Detail & Related papers (2025-05-05T06:11:12Z) - The Lessons of Developing Process Reward Models in Mathematical Reasoning [62.165534879284735]
Process Reward Models (PRMs) aim to identify and mitigate intermediate errors in the reasoning processes.
We develop a consensus filtering mechanism that effectively integrates Monte Carlo (MC) estimation with Large Language Models (LLMs)
We release a new state-of-the-art PRM that outperforms existing open-source alternatives.
arXiv Detail & Related papers (2025-01-13T13:10:16Z) - Self-Evolved Reward Learning for LLMs [45.6910747154447]
Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for aligning language models with human preferences.
We propose Self-Evolved Reward Learning (SER), a novel approach where the RM generates additional training data to iteratively improve itself.
Our results demonstrate that even with limited human-annotated data, learning from self-feedback can robustly enhance RM performance.
arXiv Detail & Related papers (2024-11-01T07:29:03Z) - RRM: Robust Reward Model Training Mitigates Reward Hacking [51.12341734942797]
Reward models (RMs) play a pivotal role in aligning large language models with human preferences.
We introduce a causal framework that learns preferences independent of these artifacts.
Experiments show that our approach successfully filters out undesirable artifacts, yielding a more robust reward model.
arXiv Detail & Related papers (2024-09-20T01:46:07Z) - Semi-Supervised Reward Modeling via Iterative Self-Training [52.48668920483908]
We propose Semi-Supervised Reward Modeling (SSRM), an approach that enhances RM training using unlabeled data.
We demonstrate that SSRM significantly improves reward models without incurring additional labeling costs.
Overall, SSRM substantially reduces the dependency on large volumes of human-annotated data, thereby decreasing the overall cost and time involved in training effective reward models.
arXiv Detail & Related papers (2024-09-10T22:57:58Z) - Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts [23.27203570485055]
Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models with human preferences.
We propose a two-stage approach to train a reward model (RM) with multi-dimensional absolute-rating data.
We efficiently trained an ArmoRM with Llama-3 8B and a gating network consisting of a shallow on top of the ArmoRM.
arXiv Detail & Related papers (2024-06-18T17:58:28Z) - Prior Constraints-based Reward Model Training for Aligning Large Language Models [58.33118716810208]
This paper proposes a Prior Constraints-based Reward Model (namely PCRM) training method to mitigate this problem.
PCRM incorporates prior constraints, specifically, length ratio and cosine similarity between outputs of each comparison pair, during reward model training to regulate optimization magnitude and control score margins.
Experimental results demonstrate that PCRM significantly improves alignment performance by effectively constraining reward score scaling.
arXiv Detail & Related papers (2024-04-01T07:49:11Z) - WARM: On the Benefits of Weight Averaged Reward Models [63.08179139233774]
We propose Weight Averaged Reward Models (WARM) to mitigate reward hacking.
Experiments on summarization tasks, using best-of-N and RL methods, shows that WARM improves the overall quality and alignment of LLM predictions.
arXiv Detail & Related papers (2024-01-22T18:27:08Z) - The Trickle-down Impact of Reward (In-)consistency on RLHF [71.37987812944971]
We show that reward inconsistency exhibits a trickle-down effect on the downstream Reinforcement Learning from Human Feedback process.
We propose Contrast Instructions -- a benchmarking strategy for the consistency of RM.
We show that RLHF models trained with a more consistent RM yield more useful responses.
arXiv Detail & Related papers (2023-09-28T04:05:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.