Probabilistic Uncertain Reward Model: A Natural Generalization of Bradley-Terry Reward Model
- URL: http://arxiv.org/abs/2503.22480v4
- Date: Tue, 29 Apr 2025 08:41:59 GMT
- Title: Probabilistic Uncertain Reward Model: A Natural Generalization of Bradley-Terry Reward Model
- Authors: Wangtao Sun, Xiang Cheng, Xing Yu, Haotian Xu, Zhao Yang, Shizhu He, Jun Zhao, Kang Liu,
- Abstract summary: We propose a Probabilistic Uncertain Reward Model (PURM) to address reward hacking.<n>We show that PURM effectively models the rewards and uncertainties, and significantly delays the onset of reward hacking.
- Score: 27.40414952747553
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning from Human Feedback (RLHF) has emerged as a critical technique for training large language models. However, reward hacking-a phenomenon where models exploit flaws in the reward model-remains a significant barrier to achieving robust and scalable intelligence through long-term training. Existing studies have proposed the uncertain reward models to address reward hacking, however, they often lack systematic or theoretical foundations, failing to model the uncertainty intrinsically emerging from preference data, and thus cannot sufficiently mitigate reward hacking to sustain prolonged RLHF training and exploration. In this paper, we propose a Probabilistic Uncertain Reward Model (PURM), a natural generalization of the classical Bradley-Terry reward model, that can directly learn the reward distribution emerged from the preference data. We theoretically derived PURM's loss function and the reward distribution uncertainty calculation based on Bhattacharyya Coefficient. To mitigate reward hacking with PURM, we further introduce an uncertainty-aware penalty into Proximal Policy Optimization (PPO), which leverages the learned uncertainty to dynamically balance reward optimization and exploration. We propose a lightweight and easy-to-use implementation of PURM. Experiments demonstrate that PURM effectively models the rewards and uncertainties, and significantly delays the onset of reward hacking while improving final reward performance compared with existing methods.
Related papers
- Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning [25.817231106021552]
Process reward models (PRMs) have proven effective for test-time scaling of Large Language Models (LLMs) on challenging reasoning tasks.
However, reward hacking issues with PRMs limit their successful application in reinforcement fine-tuning.
In this paper, we identify the main cause of PRM-induced reward hacking: the canonical summation-form credit assignment in reinforcement learning.
arXiv Detail & Related papers (2025-04-21T17:59:02Z) - Adversarial Training of Reward Models [74.17196154247964]
We introduce Adv-RM, a novel adversarial training framework that automatically identifies adversarial examples.
By leveraging reinforcement learning, Adv-RM trains a policy to expose vulnerabilities in large state-of-the-art reward models.
We demonstrate that Adv-RM significantly outperforms conventional reward training.
arXiv Detail & Related papers (2025-04-08T15:38:25Z) - Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems [54.4392552373835]
Reward models (RMs) are crucial for the training and inference-time scaling up of large language models (LLMs)<n>We propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals to provide reliable rewards.<n>We conduct comprehensive experiments on existing reward model benchmarks and inference time best-of-n searches on real-world downstream tasks.
arXiv Detail & Related papers (2025-02-26T17:19:12Z) - Towards Reliable Alignment: Uncertainty-aware RLHF [14.20181662644689]
We show that the fluctuation of reward models can be detrimental to the alignment problem.
We show that such policies are more risk-averse in the sense that they are more cautious of uncertain rewards.
We use this ensemble of reward models to align language model using our methodology and observe that our empirical findings match our theoretical predictions.
arXiv Detail & Related papers (2024-10-31T08:26:51Z) - Evaluating Robustness of Reward Models for Mathematical Reasoning [14.97819343313859]
We introduce a new design for reliable evaluation of reward models, and to validate this, we construct RewardMATH.
We demonstrate that the scores on RewardMATH strongly correlate with the results of optimized policy and effectively estimate reward overoptimization.
arXiv Detail & Related papers (2024-10-02T16:39:58Z) - The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret [64.04721528586747]
We show that a sufficiently low expected test error of the reward model guarantees low worst-case regret.
We then show that similar problems persist even when using policy regularization techniques.
arXiv Detail & Related papers (2024-06-22T06:43:51Z) - Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs [25.011675414622392]
This study introduces a novel approach to enhance the reward model's generalization ability against distribution shifts.
We retain the base model's language model head and incorporate a suite of text-generation losses to preserve the hidden states' text-generation capabilities.
Our experimental results demonstrate that the introduced regularization technique markedly improves the accuracy of learned reward models.
arXiv Detail & Related papers (2024-06-14T17:49:59Z) - Robust Preference Optimization through Reward Model Distillation [68.65844394615702]
Direct Preference Optimization (DPO) is a popular offline alignment method that trains a policy directly on preference data.<n>We analyze this phenomenon and use distillation to get a better proxy for the true preference distribution over generation pairs.<n>Our results show that distilling from such a family of reward models leads to improved robustness to distribution shift in preference annotations.
arXiv Detail & Related papers (2024-05-29T17:39:48Z) - InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling [66.3072381478251]
Reward hacking, also termed reward overoptimization, remains a critical challenge.
We propose a framework for reward modeling, namely InfoRM, by introducing a variational information bottleneck objective.
We show that InfoRM's overoptimization detection mechanism is not only effective but also robust across a broad range of datasets.
arXiv Detail & Related papers (2024-02-14T17:49:07Z) - Stabilizing RLHF through Advantage Model and Selective Rehearsal [57.504894664689]
Large Language Models (LLMs) have revolutionized natural language processing, yet aligning these models with human values and preferences remains a significant challenge.
This challenge is characterized by various instabilities, such as reward hacking and catastrophic forgetting.
We propose two innovations to stabilize RLHF training: 1) Advantage Model, which directly models advantage score and regulates score distributions across tasks to prevent reward hacking; and 2) Selective Rehearsal, which mitigates catastrophic forgetting by strategically selecting data for PPO training and knowledge rehearsing.
arXiv Detail & Related papers (2023-09-18T23:06:32Z) - Distributional Reward Estimation for Effective Multi-Agent Deep
Reinforcement Learning [19.788336796981685]
We propose a novel Distributional Reward Estimation framework for effective Multi-Agent Reinforcement Learning (DRE-MARL)
Our main idea is to design the multi-action-branch reward estimation and policy-weighted reward aggregation for stabilized training.
The superiority of the DRE-MARL is demonstrated using benchmark multi-agent scenarios, compared with the SOTA baselines in terms of both effectiveness and robustness.
arXiv Detail & Related papers (2022-10-14T08:31:45Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.