General Exploratory Bonus for Optimistic Exploration in RLHF
- URL: http://arxiv.org/abs/2510.03269v2
- Date: Tue, 14 Oct 2025 14:34:23 GMT
- Title: General Exploratory Bonus for Optimistic Exploration in RLHF
- Authors: Wendi Li, Changdae Oh, Sharon Li,
- Abstract summary: Current formulations unintentionally bias exploration toward high-probability regions of the reference model.<n>We introduce the General Exploratory Bonus (GEB), a novel theoretical framework that provably satisfies the optimism principle.
- Score: 14.355066862800747
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Optimistic exploration is central to improving sample efficiency in reinforcement learning with human feedback, yet existing exploratory bonus methods to incentivize exploration often fail to realize optimism. We provide a theoretical analysis showing that current formulations, under KL or $\alpha$-divergence regularization, unintentionally bias exploration toward high-probability regions of the reference model, thereby reinforcing conservative behavior instead of promoting discovery of uncertain regions. To address this pitfall, we introduce the General Exploratory Bonus (GEB), a novel theoretical framework that provably satisfies the optimism principle. GEB counteracts divergence-induced bias via reference-dependent reward regulation and unifies prior heuristic bonuses as special cases, while extending naturally across the full $\alpha$-divergence family. Empirically, GEB consistently outperforms baselines on alignment tasks across multiple divergence settings and large language model backbones. These results demonstrate that GEB offers both a principled and practical solution for optimistic exploration in RLHF.
Related papers
- GARDO: Reinforcing Diffusion Models without Reward Hacking [54.841464430913476]
Fine-tuning diffusion models via online reinforcement learning (RL) has shown great potential for enhancing text-to-image alignment.<n>The mismatch often leads to reward hacking, where proxy scores increase while real image quality deteriorates and generation diversity collapses.<n>We propose Gated and Adaptive Regularization with Diversity-aware Optimization (GARDO) to address the competing demands of sample efficiency, effective exploration, and mitigation of reward hacking.
arXiv Detail & Related papers (2025-12-30T10:55:45Z) - Greedy Sampling Is Provably Efficient for RLHF [19.590316589389577]
This work considers the general preference model and obtains performance guarantees with major, order-wise improvements over existing ones.<n>Surprisingly, these results are derived from algorithms that directly use the empirical estimates.<n>This insight has a deep root in the unique structural property of the optimal policy class under the KL-regularized target, and we further specialize it to the BT model.
arXiv Detail & Related papers (2025-10-28T17:52:08Z) - $\text{G}^2$RPO: Granular GRPO for Precise Reward in Flow Models [74.21206048155669]
We propose a novel Granular-GRPO ($textG2$RPO ) framework that achieves precise and comprehensive reward assessments of sampling directions.<n>We also introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales, producing a more comprehensive and robust evaluation of the sampling directions.
arXiv Detail & Related papers (2025-10-02T12:57:12Z) - Diversity-Incentivized Exploration for Versatile Reasoning [63.653348177250756]
We propose textbfDIVER (textbfDi-textbfIncentivized Exploration for textbfVersatiltextbfE textbfReasoning), an innovative framework that highlights the pivotal role of global sequence-level diversity to incentivize deep exploration for versatile reasoning.
arXiv Detail & Related papers (2025-09-30T13:11:46Z) - Uncalibrated Reasoning: GRPO Induces Overconfidence for Stochastic Outcomes [55.2480439325792]
Reinforcement learning (RL) has proven remarkably effective at improving the accuracy of language models in verifiable and deterministic domains like mathematics.<n>Here, we examine if current RL methods are also effective at optimizing language models in verifiable domains with outcomes, like scientific experiments.
arXiv Detail & Related papers (2025-08-15T20:50:53Z) - On Efficient Bayesian Exploration in Model-Based Reinforcement Learning [0.24578723416255752]
We address the challenge of data-efficient exploration in reinforcement learning by examining existing principled, information-theoretic approaches to intrinsic motivation.<n>We prove that exploration bonuses naturally signal epistemic information gains and converge to zero once the agent becomes sufficiently certain about the environment's dynamics and rewards.<n>We then outline a general framework - Predictive Trajectory Sampling with Bayesian Exploration (PTS-BE) - which integrates model-based planning with information-theoretic bonuses to achieve sample-efficient deep exploration.
arXiv Detail & Related papers (2025-07-03T14:03:47Z) - Supervised Optimism Correction: Be Confident When LLMs Are Sure [91.7459076316849]
We establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning.<n>We show that the widely used beam search method suffers from unacceptable over-optimism.<n>We propose Supervised Optimism Correction, which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations.
arXiv Detail & Related papers (2025-04-10T07:50:03Z) - On the Importance of Exploration for Generalization in Reinforcement
Learning [89.63074327328765]
We propose EDE: Exploration via Distributional Ensemble, a method that encourages exploration of states with high uncertainty.
Our algorithm is the first value-based approach to achieve state-of-the-art on both Procgen and Crafter.
arXiv Detail & Related papers (2023-06-08T18:07:02Z) - DEIR: Efficient and Robust Exploration through
Discriminative-Model-Based Episodic Intrinsic Rewards [2.09711130126031]
Exploration is a fundamental aspect of reinforcement learning (RL), and its effectiveness is a deciding factor in the performance of RL algorithms.
Recent studies have shown the effectiveness of encouraging exploration with intrinsic rewards estimated from novelties in observations.
We propose DEIR, a novel method in which we theoretically derive an intrinsic reward with a conditional mutual information term.
arXiv Detail & Related papers (2023-04-21T06:39:38Z) - Exploration in Model-based Reinforcement Learning with Randomized Reward [40.87376174638752]
We show that under the kernelized linear regulator (KNR) model, reward randomization guarantees a partial optimism.
We further extend our theory to generalized function approximation and identified conditions for reward randomization to attain provably efficient exploration.
arXiv Detail & Related papers (2023-01-09T01:50:55Z) - Global convergence of optimized adaptive importance samplers [0.0]
We analyze the optimized adaptive importance sampler (OAIS) for performing Monte Carlo integration with general proposals.
We derive nonasymptotic bounds for the global gradient of $chi2$-divergence for proposals.
arXiv Detail & Related papers (2022-01-02T19:56:36Z) - The Benefits of Being Categorical Distributional: Uncertainty-aware Regularized Exploration in Reinforcement Learning [17.64056793687686]
We find potential superiority of distributional RL can be attributed to a derived distribution-matching entropy regularization.<n>Our study offers a new perspective from the exploration to explain the intrinsic benefits of adopting distributional learning in RL.
arXiv Detail & Related papers (2021-10-07T03:14:46Z) - Reinforced Imitation Learning by Free Energy Principle [2.9327503320877457]
Reinforcement Learning (RL) requires a large amount of exploration especially in sparse-reward settings.
Imitation Learning (IL) can learn from expert demonstrations without exploration.
We radically unify RL and IL based on Free Energy Principle (FEP)
arXiv Detail & Related papers (2021-07-25T14:19:29Z) - Principled Exploration via Optimistic Bootstrapping and Backward
Induction [84.78836146128238]
We propose a principled exploration method for Deep Reinforcement Learning (DRL) through Optimistic Bootstrapping and Backward Induction (OB2I)
OB2I constructs a general-purpose UCB-bonus through non-parametric bootstrap in DRL.
We build theoretical connections between the proposed UCB-bonus and the LSVI-UCB in a linear setting.
arXiv Detail & Related papers (2021-05-13T01:15:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.