Establishing Reliability Metrics for Reward Models in Large Language Models
- URL: http://arxiv.org/abs/2504.14838v1
- Date: Mon, 21 Apr 2025 03:39:33 GMT
- Title: Establishing Reliability Metrics for Reward Models in Large Language Models
- Authors: Yizhou Chen, Yawen Liu, Xuesi Wang, Qingtao Yu, Guangda Huzhang, Anxiang Zeng, Han Yu, Zhiming Zhou,
- Abstract summary: The reward model (RM) that represents human preferences plays a crucial role in optimizing the outputs of large language models (LLMs)<n>We propose the itunderlineReliable at underline$eta$ (RETA) metric to measure the reliability of RMs.<n>On top of RETA, we present an integrated benchmarking pipeline that allows anyone to evaluate their own RM without incurring additional Oracle labeling costs.
- Score: 17.26528659228218
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The reward model (RM) that represents human preferences plays a crucial role in optimizing the outputs of large language models (LLMs), e.g., through reinforcement learning from human feedback (RLHF) or rejection sampling. However, a long challenge for RM is its uncertain reliability, i.e., LLM outputs with higher rewards may not align with actual human preferences. Currently, there is a lack of a convincing metric to quantify the reliability of RMs. To bridge this gap, we propose the \textit{\underline{R}eliable at \underline{$\eta$}} (RETA) metric, which directly measures the reliability of an RM by evaluating the average quality (scored by an oracle) of the top $\eta$ quantile responses assessed by an RM. On top of RETA, we present an integrated benchmarking pipeline that allows anyone to evaluate their own RM without incurring additional Oracle labeling costs. Extensive experimental studies demonstrate the superior stability of RETA metric, providing solid evaluations of the reliability of various publicly available and proprietary RMs. When dealing with an unreliable RM, we can use the RETA metric to identify the optimal quantile from which to select the responses.
Related papers
- Energy-Based Reward Models for Robust Language Model Alignment [9.843359827321194]
We introduce Energy-Based Reward Model (EBRM), a lightweight post-hoc refinement framework for Reward Models (RMs)<n>EBRM models the reward distribution explicitly, capturing uncertainty in human preferences and mitigating the impact of noisy or misaligned annotations.<n> Empirical evaluations demonstrate significant improvements in robustness and generalization, achieving up to a 5.97% improvement in safety-critical alignment tasks.
arXiv Detail & Related papers (2025-04-17T17:47:15Z) - Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality Robustness [61.87055159919641]
Multi-modal semantic segmentation (MMSS) addresses the limitations of single-modality data by integrating complementary information across modalities.
Despite notable progress, a significant gap persists between research and real-world deployment due to variability and uncertainty in multi-modal data quality.
We introduce a robustness benchmark that evaluates MMSS models under three scenarios: Entire-Missing Modality (EMM), Random-Missing Modality (RMM), and Noisy Modality (NM)
arXiv Detail & Related papers (2025-03-24T08:46:52Z) - The Lessons of Developing Process Reward Models in Mathematical Reasoning [62.165534879284735]
Process Reward Models (PRMs) aim to identify and mitigate intermediate errors in the reasoning processes.
We develop a consensus filtering mechanism that effectively integrates Monte Carlo (MC) estimation with Large Language Models (LLMs)
We release a new state-of-the-art PRM that outperforms existing open-source alternatives.
arXiv Detail & Related papers (2025-01-13T13:10:16Z) - RMB: Comprehensively Benchmarking Reward Models in LLM Alignment [44.84304822376291]
Reward models (RMs) guide the alignment of large language models (LLMs)<n>We propose RMB, a comprehensive RM benchmark that covers over 49 real-world scenarios.<n>Based on our benchmark, we conduct extensive analysis on the state-of-the-art RMs.
arXiv Detail & Related papers (2024-10-13T16:06:54Z) - Uncertainty-aware Reward Model: Teaching Reward Models to Know What is Unknown [20.753374166695494]
We introduce the Uncertainty-aware Reward Model (URM) and its ensemble variant, URME.<n>URM employs a probabilistic value head to capture aleatoric uncertainty by modeling the distribution of disentangled human preference attributes.<n> URME further quantifies uncertainty by examining discrepancies among individual URMs within the ensemble, enabling identification of unreliable evaluations.
arXiv Detail & Related papers (2024-10-01T16:29:59Z) - RRM: Robust Reward Model Training Mitigates Reward Hacking [51.12341734942797]
Reward models (RMs) play a pivotal role in aligning large language models with human preferences.<n>We introduce a causal framework that learns preferences independent of these artifacts.<n>Experiments show that our approach successfully filters out undesirable artifacts, yielding a more robust reward model.
arXiv Detail & Related papers (2024-09-20T01:46:07Z) - Reward-Robust RLHF in LLMs [25.31456438114974]
Large Language Models (LLMs) continue to progress toward more advanced forms of intelligence.
The reliance on reward-model-based (RM-based) alignment methods introduces significant challenges.
We introduce a reward-robust RLHF framework aimed at addressing these fundamental challenges.
arXiv Detail & Related papers (2024-09-18T02:35:41Z) - Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts [23.27203570485055]
Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models with human preferences.
We propose a two-stage approach to train a reward model (RM) with multi-dimensional absolute-rating data.
We efficiently trained an ArmoRM with Llama-3 8B and a gating network consisting of a shallow on top of the ArmoRM.
arXiv Detail & Related papers (2024-06-18T17:58:28Z) - Confronting Reward Model Overoptimization with Constrained RLHF [114.71591361764547]
We show that correlation between component RMs has a significant effect on the locations of these points.
Our method addresses the problem of weighting component RMs by learning dynamic weights, naturally expressed by Lagrange multipliers.
arXiv Detail & Related papers (2023-10-06T16:59:17Z) - On Continual Model Refinement in Out-of-Distribution Data Streams [64.62569873799096]
Real-world natural language processing (NLP) models need to be continually updated to fix the prediction errors in out-of-distribution (OOD) data streams.
Existing continual learning (CL) problem setups cannot cover such a realistic and complex scenario.
We propose a new CL problem formulation dubbed continual model refinement (CMR)
arXiv Detail & Related papers (2022-05-04T11:54:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.