Uncertainty Quantification for Large Language Model Reward Learning under Heterogeneous Human Feedback
- URL: http://arxiv.org/abs/2512.03208v1
- Date: Tue, 02 Dec 2025 20:22:25 GMT
- Title: Uncertainty Quantification for Large Language Model Reward Learning under Heterogeneous Human Feedback
- Authors: Pangpang Liu, Junwei Lu, Will Wei Sun,
- Abstract summary: We study estimation and statistical reward models used in aligning large language (LLMs)<n>A key component of LLM alignment is reinforcement learning from human feedback.
- Score: 8.538830579425147
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study estimation and statistical inference for reward models used in aligning large language models (LLMs). A key component of LLM alignment is reinforcement learning from human feedback (RLHF), where humans compare pairs of model-generated answers and their preferences are used to train a reward model. However, human feedback is inherently heterogeneous, creating significant challenges for reliable reward learning. To address this, we adopt a heterogeneous preference framework that jointly models the latent reward of answers and human rationality. This leads to a challenging biconvex optimization problem, which we solve via an alternating gradient descent algorithm. We establish theoretical guarantees for the resulting estimator, including its convergence and asymptotic distribution. These results enable the construction of confidence intervals for reward estimates. Leveraging these uncertainty quantification results, we conduct valid statistical comparisons between rewards and incorporate uncertainty into the best-of-$N$ (BoN) policy framework. Extensive simulations demonstrate the effectiveness of our method, and applications to real LLM data highlight the practical value of accounting for uncertainty in reward modeling for LLM alignment.
Related papers
- Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling [49.41422138354821]
We propose a principled reward modeling framework that integrates non-negative factor analysis into the Bradley-Terry preference model.<n>BNRM represents rewards through a sparse, non-negative latent factor generative process.<n>We show that BNRM substantially mitigates reward over-optimization, improves robustness under distribution shifts, and yields more interpretable reward decompositions than strong baselines.
arXiv Detail & Related papers (2026-02-11T08:14:11Z) - Probing Preference Representations: A Multi-Dimensional Evaluation and Analysis Method for Reward Models [63.00458229517523]
This work addresses the evaluation challenge of reward models by probing preference representations.<n>We construct a Multi-dimensional Reward Model Benchmark (MRMBench), a collection of six probing tasks for different preference dimensions.<n>We introduce an analysis method, inference-time probing, which identifies the dimensions used during the reward prediction and enhances its interpretability.
arXiv Detail & Related papers (2025-11-16T05:29:29Z) - Confidence as a Reward: Transforming LLMs into Reward Models [54.98336080630691]
Confidence-as-a-Reward (CRew) is a training-free method that utilizes token-level confidence in the model's final answers as a proxy for reward.<n>We show that CRew outperforms existing training-free reward approaches on the MATH500 and RewardMATH benchmarks.<n>We propose CRew-DPO, a training strategy that constructs preference data from confidence scores combined with correctness signals.
arXiv Detail & Related papers (2025-10-15T12:51:47Z) - Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment [1.8552770604791606]
We propose a hybrid reward modeling framework that integrates complementary reward paradigms.<n>We show consistent improvements across different multimodal benchmarks when applying hybrid and multi-aspect reward modeling.<n>Our best performing model in the 3B family achieves an overall average improvement of 9.5% across general and math reasoning tasks.
arXiv Detail & Related papers (2025-10-06T18:53:23Z) - Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning [11.31665596884142]
Reinforcement learning from human feedback (RLHF) has emerged as a key technique for aligning the output of large language models with human preferences.<n>Most existing RLHF algorithms use the Bradley-Terry model, which relies on assumptions about human preferences that may not reflect the complexity and variability of real-world judgments.<n>We propose a robust algorithm to enhance the performance of existing approaches under such reward model misspecifications.
arXiv Detail & Related papers (2025-04-03T16:16:35Z) - Disentangling Length Bias In Preference Learning Via Response-Conditioned Modeling [87.17041933863041]
Reinforcement Learning from Human Feedback (RLHF) has achieved considerable success in aligning large language models (LLMs)<n>We introduce a $textbfR$esponse-$textbfc$onditioned $textbfB$radley-$textbfT$erry (Rc-BT) model that enhances the model's capability in length bias mitigating and length instruction following.<n>We also propose the Rc-RM and Rc-DPO algorithm to leverage the Rc-BT model for reward modeling and direct policy optimization
arXiv Detail & Related papers (2025-02-02T14:50:25Z) - Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment [30.605500809158986]
We propose a novel causal reward modeling approach that integrates causality to mitigate spurious correlations.<n>Our approach mitigates various types of spurious correlations effectively, resulting in more reliable and fair alignment of LLMs with human preferences.
arXiv Detail & Related papers (2025-01-16T16:00:37Z) - Self-Evolutionary Large Language Models through Uncertainty-Enhanced Preference Optimization [9.618391485742968]
Iterative preference optimization has recently become one of the de-facto training paradigms for large language models (LLMs)
We present an uncertainty-enhanced textbfPreference textbfOptimization framework to make the LLM self-evolve with reliable feedback.
Our framework substantially alleviates the noisy problem and improves the performance of iterative preference optimization.
arXiv Detail & Related papers (2024-09-17T14:05:58Z) - A Unified Linear Programming Framework for Offline Reward Learning from Human Demonstrations and Feedback [6.578074497549894]
Inverse Reinforcement Learning (IRL) and Reinforcement Learning from Human Feedback (RLHF) are pivotal methodologies in reward learning.
This paper introduces a novel linear programming (LP) framework tailored for offline reward learning.
arXiv Detail & Related papers (2024-05-20T23:59:26Z) - When Demonstrations Meet Generative World Models: A Maximum Likelihood
Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent.
Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z) - Principled Reinforcement Learning with Human Feedback from Pairwise or
$K$-wise Comparisons [79.98542868281473]
We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF)
We show that when training a policy based on the learned reward model, MLE fails while a pessimistic MLE provides policies with improved performance under certain coverage assumptions.
arXiv Detail & Related papers (2023-01-26T18:07:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.