Related papers: On the Robustness of Reward Models for Language Model Alignment

On the Robustness of Reward Models for Language Model Alignment

URL: http://arxiv.org/abs/2505.07271v1
Date: Mon, 12 May 2025 06:48:26 GMT
Title: On the Robustness of Reward Models for Language Model Alignment
Authors: Jiwoo Hong, Noah Lee, Eunki Kim, Guijin Son, Woojin Chung, Aman Gupta, Shao Tang, James Thorne,
Abstract summary: We study the cause of over-optimization in reward models trained with the Bradley-Terry (BT) model.<n>We show that the excessive dispersion of hidden state norms is the main source of over-optimization.<n>We apply BSR to high-quality data and models, which surpasses state-of-the-art RMs in the 8B scale.
Score: 9.804782604188656
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Bradley-Terry (BT) model is widely practiced in reward modeling for reinforcement learning with human feedback (RLHF). Despite its effectiveness, reward models (RMs) trained with BT model loss are prone to over-optimization, losing generalizability to unseen input distributions. In this paper, we study the cause of over-optimization in RM training and its downstream effects on the RLHF procedure, accentuating the importance of distributional robustness of RMs in unseen data. First, we show that the excessive dispersion of hidden state norms is the main source of over-optimization. Then, we propose batch-wise sum-to-zero regularization (BSR) to enforce zero-centered reward sum per batch, constraining the rewards with extreme magnitudes. We assess the impact of BSR in improving robustness in RMs through four scenarios of over-optimization, where BSR consistently manifests better robustness. Subsequently, we compare the plain BT model and BSR on RLHF training and empirically show that robust RMs better align the policy to the gold preference model. Finally, we apply BSR to high-quality data and models, which surpasses state-of-the-art RMs in the 8B scale by adding more than 5% in complex preference prediction tasks. By conducting RLOO training with 8B RMs, AlpacaEval 2.0 reduces generation length by 40% while adding a 7% increase in win rate, further highlighting that robustness in RMs induces robustness in RLHF training. We release the code, data, and models: https://github.com/LinkedIn-XFACT/RM-Robustness.

Related papers

Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback [52.1410307583181]
We useReinforcement Learning from Human Feedback to train language models (LMs) to follow complex human preferences.<n>As training progresses, the responses generated by the LM no longer resemble the responses seen by the reward model (RM)<n>We propose Off-Policy Corrected Reward Modeling to correct the RM using importance weighting, without requiring new labels or samples.
arXiv Detail & Related papers (2025-07-21T11:19:04Z)
ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs [56.32212611983997]
We introduce ReasonFlux-PRM, a novel trajectory-aware PRM to evaluate trajectory-response type of reasoning traces.<n>ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data.<n>Our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling.
arXiv Detail & Related papers (2025-06-23T17:59:02Z)
Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models [50.4652276723694]
Think-RM generates flexible, self-guided reasoning traces that support advanced capabilities.<n>Think-RM achieves state-of-the-art results on RM-Bench, outperforming both BT RM and vertically scaled GenRM by 8%.
arXiv Detail & Related papers (2025-05-22T05:56:11Z)
R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning [22.167272219418845]
Multimodal Reward Models (MRMs) play a crucial role in enhancing the performance of Multimodal Large Language Models (MLLMs)<n>We propose the StableReinforce algorithm, which refines the training loss, advantage estimation strategy, and reward design of existing RL methods.<n>Our reward model, R1-Reward, trained using the StableReinforce algorithm on this dataset, significantly improves performance on multimodal reward modeling benchmarks.
arXiv Detail & Related papers (2025-05-05T17:59:50Z)
RM-R1: Reward Modeling as Reasoning [81.50471199906738]
We introduce a new class of generative reward models -- Reward Reasoning Models (ReasRMs)<n>We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1.<n>Our models achieve state-of-the-art or near state-of-the-art performance of generative RMs across multiple benchmarks.
arXiv Detail & Related papers (2025-05-05T06:11:12Z)
Energy-Based Reward Models for Robust Language Model Alignment [9.843359827321194]
We introduce Energy-Based Reward Model (EBRM), a lightweight post-hoc refinement framework for Reward Models (RMs)<n>EBRM models the reward distribution explicitly, capturing uncertainty in human preferences and mitigating the impact of noisy or misaligned annotations.<n> Empirical evaluations demonstrate significant improvements in robustness and generalization, achieving up to a 5.97% improvement in safety-critical alignment tasks.
arXiv Detail & Related papers (2025-04-17T17:47:15Z)
Adversarial Training of Reward Models [74.17196154247964]
We introduce Adv-RM, a novel adversarial training framework that automatically identifies adversarial examples.<n>By leveraging reinforcement learning, Adv-RM trains a policy to expose vulnerabilities in large state-of-the-art reward models.<n>We demonstrate that Adv-RM significantly outperforms conventional reward training.
arXiv Detail & Related papers (2025-04-08T15:38:25Z)
RRM: Robust Reward Model Training Mitigates Reward Hacking [51.12341734942797]
Reward models (RMs) play a pivotal role in aligning large language models with human preferences.<n>We introduce a causal framework that learns preferences independent of these artifacts.<n>Experiments show that our approach successfully filters out undesirable artifacts, yielding a more robust reward model.
arXiv Detail & Related papers (2024-09-20T01:46:07Z)
WARM: On the Benefits of Weight Averaged Reward Models [63.08179139233774]
We propose Weight Averaged Reward Models (WARM) to mitigate reward hacking. Experiments on summarization tasks, using best-of-N and RL methods, shows that WARM improves the overall quality and alignment of LLM predictions.
arXiv Detail & Related papers (2024-01-22T18:27:08Z)
The Trickle-down Impact of Reward (In-)consistency on RLHF [71.37987812944971]
We show that reward inconsistency exhibits a trickle-down effect on the downstream Reinforcement Learning from Human Feedback process. We propose Contrast Instructions -- a benchmarking strategy for the consistency of RM. We show that RLHF models trained with a more consistent RM yield more useful responses.
arXiv Detail & Related papers (2023-09-28T04:05:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.