Uncertainty-aware Reward Model: Teaching Reward Models to Know What is Unknown
- URL: http://arxiv.org/abs/2410.00847v2
- Date: Wed, 12 Feb 2025 03:34:29 GMT
- Title: Uncertainty-aware Reward Model: Teaching Reward Models to Know What is Unknown
- Authors: Xingzhou Lou, Dong Yan, Wei Shen, Yuzi Yan, Jian Xie, Junge Zhang,
- Abstract summary: We introduce the Uncertainty-aware Reward Model (URM) and its ensemble variant, URME.
URM employs a probabilistic value head to capture aleatoric uncertainty by modeling the distribution of disentangled human preference attributes.
URME further quantifies uncertainty by examining discrepancies among individual URMs within the ensemble, enabling identification of unreliable evaluations.
- Score: 20.753374166695494
- License:
- Abstract: Reward models (RMs) are essential for aligning large language models (LLM) with human expectations. However, existing RMs struggle to capture the stochastic and uncertain nature of human preferences and fail to assess the reliability of reward predictions. To address these challenges, we introduce the Uncertainty-aware Reward Model (URM) and its ensemble variant, URME. URM employs a probabilistic value head to capture aleatoric uncertainty by modeling the distribution of disentangled human preference attributes. URME further quantifies epistemic uncertainty by examining discrepancies among individual URMs within the ensemble, enabling identification of unreliable evaluations. Our empirical evaluations demonstrate that URM achieves strong performance on RewardBench, outperforming competitive large-scale models. Additionally, extensive experiments, including best-of-n sampling (BoN), iterative direct preference optimization (iterative DPO), and proximal policy optimization (PPO), demonstrate that URM and URME significantly enhance LLMs' generation quality. Notably, reward predictions with lower uncertainty are far more reliable, demonstrate significantly higher quality, and result in substantially improved alignment.
Related papers
- Uncertainty-Aware Step-wise Verification with Generative Reward Models [42.17917357636397]
We propose leveraging uncertainty quantification (UQ) to enhance the reliability of step-wise verification with generative reward models.
We introduce CoT Entropy, a novel UQ method that outperforms existing approaches in quantifying a PRM's uncertainty in step-wise verification.
arXiv Detail & Related papers (2025-02-16T20:00:56Z) - The Lessons of Developing Process Reward Models in Mathematical Reasoning [62.165534879284735]
Process Reward Models (PRMs) aim to identify and mitigate intermediate errors in the reasoning processes.
We develop a consensus filtering mechanism that effectively integrates Monte Carlo (MC) estimation with Large Language Models (LLMs)
We release a new state-of-the-art PRM that outperforms existing open-source alternatives.
arXiv Detail & Related papers (2025-01-13T13:10:16Z) - RMB: Comprehensively Benchmarking Reward Models in LLM Alignment [44.84304822376291]
Reward models (RMs) guide the alignment of large language models (LLMs)
We propose RMB, a comprehensive RM benchmark that covers over 49 real-world scenarios.
Based on our benchmark, we conduct extensive analysis on the state-of-the-art RMs.
arXiv Detail & Related papers (2024-10-13T16:06:54Z) - Reward-Robust RLHF in LLMs [25.31456438114974]
Large Language Models (LLMs) continue to progress toward more advanced forms of intelligence.
The reliance on reward-model-based (RM-based) alignment methods introduces significant challenges.
We introduce a reward-robust RLHF framework aimed at addressing these fundamental challenges.
arXiv Detail & Related papers (2024-09-18T02:35:41Z) - Self-Evolutionary Large Language Models through Uncertainty-Enhanced Preference Optimization [9.618391485742968]
Iterative preference optimization has recently become one of the de-facto training paradigms for large language models (LLMs)
We present an uncertainty-enhanced textbfPreference textbfOptimization framework to make the LLM self-evolve with reliable feedback.
Our framework substantially alleviates the noisy problem and improves the performance of iterative preference optimization.
arXiv Detail & Related papers (2024-09-17T14:05:58Z) - Semi-Supervised Reward Modeling via Iterative Self-Training [52.48668920483908]
We propose Semi-Supervised Reward Modeling (SSRM), an approach that enhances RM training using unlabeled data.
We demonstrate that SSRM significantly improves reward models without incurring additional labeling costs.
Overall, SSRM substantially reduces the dependency on large volumes of human-annotated data, thereby decreasing the overall cost and time involved in training effective reward models.
arXiv Detail & Related papers (2024-09-10T22:57:58Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Measuring and Modeling Uncertainty Degree for Monocular Depth Estimation [50.920911532133154]
The intrinsic ill-posedness and ordinal-sensitive nature of monocular depth estimation (MDE) models pose major challenges to the estimation of uncertainty degree.
We propose to model the uncertainty of MDE models from the perspective of the inherent probability distributions.
By simply introducing additional training regularization terms, our model, with surprisingly simple formations and without requiring extra modules or multiple inferences, can provide uncertainty estimations with state-of-the-art reliability.
arXiv Detail & Related papers (2023-07-19T12:11:15Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Providing reliability in Recommender Systems through Bernoulli Matrix
Factorization [63.732639864601914]
This paper proposes Bernoulli Matrix Factorization (BeMF) to provide both prediction values and reliability values.
BeMF acts on model-based collaborative filtering rather than on memory-based filtering.
The more reliable a prediction is, the less liable it is to be wrong.
arXiv Detail & Related papers (2020-06-05T14:24:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.