Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking
- URL: http://arxiv.org/abs/2510.13694v1
- Date: Wed, 15 Oct 2025 15:51:59 GMT
- Title: Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking
- Authors: Yuchun Miao, Liang Ding, Sen Zhang, Rong Bao, Lefei Zhang, Dacheng Tao,
- Abstract summary: We propose an information-theoretic reward modeling framework based on the Information Bottleneck principle.<n>We show that InfoRM filters out preference-irrelevant information to alleviate reward misgeneralization.<n>We also introduce IBL, a distribution-level regularization that penalizes such deviations, effectively expanding the optimization landscape.
- Score: 78.69179041551014
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the success of Reinforcement Learning from Human Feedback (RLHF) in aligning language models with human values, reward hacking-or reward over-optimization-remains a major challenge. We identify two key obstacles to its mitigation: (1) reward misgeneralization in reward modeling, where reward models overfit to spurious, preference-irrelevant features; and (2) the lack of suitable regularization during RL optimization, as existing token-level constraints often over-restrict the policy space. To address these issues, we propose InfoRM, an information-theoretic reward modeling framework based on the Information Bottleneck (IB) principle, which filters out preference-irrelevant information to alleviate reward misgeneralization. We further observe that reward-hacked responses manifest as pronounced outliers in InfoRM's IB latent space, measured by Mahalanobis distance from the SFT-induced distribution. Motivated by this, we introduce IBL, a distribution-level regularization that penalizes such deviations, effectively expanding the optimization landscape while maintaining alignment. We prove that IBL is theoretically equivalent to the pessimistic RL objective within the IB latent space. Finally, we present Mahalanobis Outlier Probability (MOP), a statistical metric for quantifying reward hacking severity, enabling principled hyperparameter tuning and online mitigation such as early stopping. Extensive experiments across diverse LLMs and datasets confirm the generality of our findings, the effectiveness of InfoRM and IBL, and the reliability of MOP as a diagnostic tool-collectively advancing the state of RLHF.
Related papers
- Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling [49.41422138354821]
We propose a principled reward modeling framework that integrates non-negative factor analysis into the Bradley-Terry preference model.<n>BNRM represents rewards through a sparse, non-negative latent factor generative process.<n>We show that BNRM substantially mitigates reward over-optimization, improves robustness under distribution shifts, and yields more interpretable reward decompositions than strong baselines.
arXiv Detail & Related papers (2026-02-11T08:14:11Z) - Towards Sample-Efficient and Stable Reinforcement Learning for LLM-based Recommendation [56.92367609590823]
Long Chain-of-Thought (Long CoT) reasoning has shown promise in Large Language Models (LLMs)<n>We argue that Long CoT is inherently ill-suited for the sequential recommendation domain.<n>We propose RISER, a novel Reinforced Item Space Exploration framework for Recommendation.
arXiv Detail & Related papers (2026-01-31T10:02:43Z) - Data-regularized Reinforcement Learning for Diffusion Models at Scale [99.01056178660538]
We introduce Data-regularized Diffusion Reinforcement Learning ( DDRL), a novel framework that uses the forward KL divergence to anchor the policy to an off-policy data distribution.<n>With over a million GPU hours of experiments and ten thousand double-blind evaluations, we demonstrate that DDRL significantly improves rewards while alleviating the reward hacking seen in RLs.
arXiv Detail & Related papers (2025-12-03T23:45:07Z) - Mitigating Reward Over-Optimization in RLHF via Behavior-Supported Regularization [23.817251267022847]
We propose the Behavior-Supported Policy Optimization (BSPO) method to mitigate the reward over-optimization issue.<n>BSPO reduces the generation of OOD responses during the reinforcement learning process.<n> Empirical results show that BSPO outperforms baselines in preventing reward over-optimization.
arXiv Detail & Related papers (2025-03-23T16:20:59Z) - Self-Evolutionary Large Language Models through Uncertainty-Enhanced Preference Optimization [9.618391485742968]
Iterative preference optimization has recently become one of the de-facto training paradigms for large language models (LLMs)
We present an uncertainty-enhanced textbfPreference textbfOptimization framework to make the LLM self-evolve with reliable feedback.
Our framework substantially alleviates the noisy problem and improves the performance of iterative preference optimization.
arXiv Detail & Related papers (2024-09-17T14:05:58Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling [66.3072381478251]
Reward hacking, also termed reward overoptimization, remains a critical challenge.
We propose a framework for reward modeling, namely InfoRM, by introducing a variational information bottleneck objective.
We show that InfoRM's overoptimization detection mechanism is not only effective but also robust across a broad range of datasets.
arXiv Detail & Related papers (2024-02-14T17:49:07Z) - Uncertainty-Penalized Reinforcement Learning from Human Feedback with
Diverse Reward LoRA Ensembles [26.955375398765085]
Reinforcement learning from human feedback (RLHF) emerges as a promising paradigm for aligning large language models (LLMs)
In this paper, we observe the weakness of KL regularization which is commonly employed in existing RLHF methods to address overoptimization.
We propose uncertainty-penalized RLHF (UP-RLHF), which incorporates uncertainty regularization during RL-finetuning.
arXiv Detail & Related papers (2023-12-30T14:14:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.