Related papers: General Exploratory Bonus for Optimistic Exploration in RLHF

General Exploratory Bonus for Optimistic Exploration in RLHF

URL: http://arxiv.org/abs/2510.03269v2
Date: Tue, 14 Oct 2025 14:34:23 GMT
Title: General Exploratory Bonus for Optimistic Exploration in RLHF
Authors: Wendi Li, Changdae Oh, Sharon Li,
Abstract summary: Current formulations unintentionally bias exploration toward high-probability regions of the reference model.<n>We introduce the General Exploratory Bonus (GEB), a novel theoretical framework that provably satisfies the optimism principle.
Score: 14.355066862800747
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Optimistic exploration is central to improving sample efficiency in reinforcement learning with human feedback, yet existing exploratory bonus methods to incentivize exploration often fail to realize optimism. We provide a theoretical analysis showing that current formulations, under KL or $\alpha$-divergence regularization, unintentionally bias exploration toward high-probability regions of the reference model, thereby reinforcing conservative behavior instead of promoting discovery of uncertain regions. To address this pitfall, we introduce the General Exploratory Bonus (GEB), a novel theoretical framework that provably satisfies the optimism principle. GEB counteracts divergence-induced bias via reference-dependent reward regulation and unifies prior heuristic bonuses as special cases, while extending naturally across the full $\alpha$-divergence family. Empirically, GEB consistently outperforms baselines on alignment tasks across multiple divergence settings and large language model backbones. These results demonstrate that GEB offers both a principled and practical solution for optimistic exploration in RLHF.

Related papers

GARDO: Reinforcing Diffusion Models without Reward Hacking [54.841464430913476]
Fine-tuning diffusion models via online reinforcement learning (RL) has shown great potential for enhancing text-to-image alignment.<n>The mismatch often leads to reward hacking, where proxy scores increase while real image quality deteriorates and generation diversity collapses.<n>We propose Gated and Adaptive Regularization with Diversity-aware Optimization (GARDO) to address the competing demands of sample efficiency, effective exploration, and mitigation of reward hacking.
arXiv Detail & Related papers (2025-12-30T10:55:45Z)
Greedy Sampling Is Provably Efficient for RLHF [19.590316589389577]
This work considers the general preference model and obtains performance guarantees with major, order-wise improvements over existing ones.<n>Surprisingly, these results are derived from algorithms that directly use the empirical estimates.<n>This insight has a deep root in the unique structural property of the optimal policy class under the KL-regularized target, and we further specialize it to the BT model.
arXiv Detail & Related papers (2025-10-28T17:52:08Z)
$\text{G}^2$RPO: Granular GRPO for Precise Reward in Flow Models [74.21206048155669]
We propose a novel Granular-GRPO ($textG2$RPO ) framework that achieves precise and comprehensive reward assessments of sampling directions.<n>We also introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales, producing a more comprehensive and robust evaluation of the sampling directions.
arXiv Detail & Related papers (2025-10-02T12:57:12Z)
Diversity-Incentivized Exploration for Versatile Reasoning [63.653348177250756]
We propose textbfDIVER (textbfDi-textbfIncentivized Exploration for textbfVersatiltextbfE textbfReasoning), an innovative framework that highlights the pivotal role of global sequence-level diversity to incentivize deep exploration for versatile reasoning.
arXiv Detail & Related papers (2025-09-30T13:11:46Z)
Uncalibrated Reasoning: GRPO Induces Overconfidence for Stochastic Outcomes [55.2480439325792]
Reinforcement learning (RL) has proven remarkably effective at improving the accuracy of language models in verifiable and deterministic domains like mathematics.<n>Here, we examine if current RL methods are also effective at optimizing language models in verifiable domains with outcomes, like scientific experiments.
arXiv Detail & Related papers (2025-08-15T20:50:53Z)
On Efficient Bayesian Exploration in Model-Based Reinforcement Learning [0.24578723416255752]
We address the challenge of data-efficient exploration in reinforcement learning by examining existing principled, information-theoretic approaches to intrinsic motivation.<n>We prove that exploration bonuses naturally signal epistemic information gains and converge to zero once the agent becomes sufficiently certain about the environment's dynamics and rewards.<n>We then outline a general framework - Predictive Trajectory Sampling with Bayesian Exploration (PTS-BE) - which integrates model-based planning with information-theoretic bonuses to achieve sample-efficient deep exploration.
arXiv Detail & Related papers (2025-07-03T14:03:47Z)
Supervised Optimism Correction: Be Confident When LLMs Are Sure [91.7459076316849]
We establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning.<n>We show that the widely used beam search method suffers from unacceptable over-optimism.<n>We propose Supervised Optimism Correction, which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations.
arXiv Detail & Related papers (2025-04-10T07:50:03Z)
On the Importance of Exploration for Generalization in Reinforcement Learning [89.63074327328765]
We propose EDE: Exploration via Distributional Ensemble, a method that encourages exploration of states with high uncertainty. Our algorithm is the first value-based approach to achieve state-of-the-art on both Procgen and Crafter.
arXiv Detail & Related papers (2023-06-08T18:07:02Z)
DEIR: Efficient and Robust Exploration through Discriminative-Model-Based Episodic Intrinsic Rewards [2.09711130126031]
Exploration is a fundamental aspect of reinforcement learning (RL), and its effectiveness is a deciding factor in the performance of RL algorithms. Recent studies have shown the effectiveness of encouraging exploration with intrinsic rewards estimated from novelties in observations. We propose DEIR, a novel method in which we theoretically derive an intrinsic reward with a conditional mutual information term.
arXiv Detail & Related papers (2023-04-21T06:39:38Z)
Exploration in Model-based Reinforcement Learning with Randomized Reward [40.87376174638752]
We show that under the kernelized linear regulator (KNR) model, reward randomization guarantees a partial optimism. We further extend our theory to generalized function approximation and identified conditions for reward randomization to attain provably efficient exploration.
arXiv Detail & Related papers (2023-01-09T01:50:55Z)
Global convergence of optimized adaptive importance samplers [0.0]
We analyze the optimized adaptive importance sampler (OAIS) for performing Monte Carlo integration with general proposals. We derive nonasymptotic bounds for the global gradient of $chi2$-divergence for proposals.
arXiv Detail & Related papers (2022-01-02T19:56:36Z)
The Benefits of Being Categorical Distributional: Uncertainty-aware Regularized Exploration in Reinforcement Learning [17.64056793687686]
We find potential superiority of distributional RL can be attributed to a derived distribution-matching entropy regularization.<n>Our study offers a new perspective from the exploration to explain the intrinsic benefits of adopting distributional learning in RL.
arXiv Detail & Related papers (2021-10-07T03:14:46Z)
Reinforced Imitation Learning by Free Energy Principle [2.9327503320877457]
Reinforcement Learning (RL) requires a large amount of exploration especially in sparse-reward settings. Imitation Learning (IL) can learn from expert demonstrations without exploration. We radically unify RL and IL based on Free Energy Principle (FEP)
arXiv Detail & Related papers (2021-07-25T14:19:29Z)
Principled Exploration via Optimistic Bootstrapping and Backward Induction [84.78836146128238]
We propose a principled exploration method for Deep Reinforcement Learning (DRL) through Optimistic Bootstrapping and Backward Induction (OB2I) OB2I constructs a general-purpose UCB-bonus through non-parametric bootstrap in DRL. We build theoretical connections between the proposed UCB-bonus and the LSVI-UCB in a linear setting.
arXiv Detail & Related papers (2021-05-13T01:15:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.