Related papers: Value Bonuses using Ensemble Errors for Exploration in Reinforcement Learning

Value Bonuses using Ensemble Errors for Exploration in Reinforcement Learning

URL: http://arxiv.org/abs/2602.12375v1
Date: Thu, 12 Feb 2026 20:12:17 GMT
Title: Value Bonuses using Ensemble Errors for Exploration in Reinforcement Learning
Authors: Abdul Wahab, Raksha Kumaraswamy, Martha White,
Abstract summary: We introduce an algorithm for exploration called Value Bonuses with Ensemble errors (VBE), that maintains an ensemble of random action-value functions (RQFs)<n>VBE uses the errors in the estimation of these RQFs to design value bonuses that provide first-visit optimism and deep exploration.<n>We show that VBE outperforms Bootstrap DQN and two reward bonus approaches (RND and ACB) on several classic environments used to test exploration.
Score: 15.766581379297193
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Optimistic value estimates provide one mechanism for directed exploration in reinforcement learning (RL). The agent acts greedily with respect to an estimate of the value plus what can be seen as a value bonus. The value bonus can be learned by estimating a value function on reward bonuses, propagating local uncertainties around rewards. However, this approach only increases the value bonus for an action retroactively, after seeing a higher reward bonus from that state and action. Such an approach does not encourage the agent to visit a state and action for the first time. In this work, we introduce an algorithm for exploration called Value Bonuses with Ensemble errors (VBE), that maintains an ensemble of random action-value functions (RQFs). VBE uses the errors in the estimation of these RQFs to design value bonuses that provide first-visit optimism and deep exploration. The key idea is to design the rewards for these RQFs in such a way that the value bonus can decrease to zero. We show that VBE outperforms Bootstrap DQN and two reward bonus approaches (RND and ACB) on several classic environments used to test exploration and provide demonstrative experiments that it can scale easily to more complex environments like Atari.

Related papers

Residual Reward Models for Preference-based Reinforcement Learning [11.797520525358564]
Preference-based Reinforcement Learning (PbRL) provides a way to learn high-performance policies in environments where the reward signal is hard to specify.<n>PbRL can suffer from slow convergence speed since it requires training in a reward model.<n>We propose a method to effectively leverage prior knowledge with a Residual Reward Model (RRM)
arXiv Detail & Related papers (2025-07-01T09:43:57Z)
Information-Theoretic Reward Decomposition for Generalizable RLHF [51.550547285296794]
We decompose the reward value into two independent components: prompt-free reward and prompt-related reward.<n>We propose a new reward learning algorithm by prioritizing data samples based on their prompt-free reward values.
arXiv Detail & Related papers (2025-04-08T13:26:07Z)
RED: Unleashing Token-Level Rewards from Holistic Feedback via Reward Redistribution [50.171320156632866]
Reinforcement learning from human feedback offers a promising approach to aligning large language models with human preferences.<n>Current reward models operate as sequence-to-one models, allocating a single, sparse, and delayed reward to an entire output sequence.<n>We propose a more fine-grained, token-level guidance approach for RL training.
arXiv Detail & Related papers (2024-11-13T02:45:21Z)
Walking the Values in Bayesian Inverse Reinforcement Learning [66.68997022043075]
Key challenge in Bayesian IRL is bridging the computational gap between the hypothesis space of possible rewards and the likelihood. We propose ValueWalk - a new Markov chain Monte Carlo method based on this insight.
arXiv Detail & Related papers (2024-07-15T17:59:52Z)
Bayesian Inverse Reinforcement Learning for Non-Markovian Rewards [7.2933135237680595]
Inverse reinforcement learning (IRL) is the problem of inferring a reward function from expert behavior. A reward function might be non-Markovian, depending on more than just the current state, such as a reward machine (RM) We propose a Bayesian IRL framework for inferring RMs directly from expert behavior, requiring significant changes to the standard framework.
arXiv Detail & Related papers (2024-06-20T04:41:54Z)
REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world.<n>Recent methods aim to mitigate misalignment by learning reward functions from human preferences.<n>We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
A Study of Global and Episodic Bonuses for Exploration in Contextual MDPs [21.31346761487944]
We show that episodic bonuses are most effective when there is little shared structure across episodes. We also find that combining the two bonuses can lead to more robust performance across different degrees of shared structure. This results in an algorithm which sets a new state of the art across 16 tasks from the MiniHack suite used in prior work.
arXiv Detail & Related papers (2023-06-05T20:45:30Z)
Distributional Reward Estimation for Effective Multi-Agent Deep Reinforcement Learning [19.788336796981685]
We propose a novel Distributional Reward Estimation framework for effective Multi-Agent Reinforcement Learning (DRE-MARL) Our main idea is to design the multi-action-branch reward estimation and policy-weighted reward aggregation for stabilized training. The superiority of the DRE-MARL is demonstrated using benchmark multi-agent scenarios, compared with the SOTA baselines in terms of both effectiveness and robustness.
arXiv Detail & Related papers (2022-10-14T08:31:45Z)
Anti-Concentrated Confidence Bonuses for Scalable Exploration [57.91943847134011]
Intrinsic rewards play a central role in handling the exploration-exploitation trade-off. We introduce emphanti-concentrated confidence bounds for efficiently approximating the elliptical bonus. We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic rewards on Atari benchmarks.
arXiv Detail & Related papers (2021-10-21T15:25:15Z)
Fast active learning for pure exploration in reinforcement learning [48.98199700043158]
We show that bonuses that scale with $1/n$ bring faster learning rates, improving the known upper bounds with respect to the dependence on the horizon. We also show that with an improved analysis of the stopping time, we can improve by a factor $H$ the sample complexity in the best-policy identification setting.
arXiv Detail & Related papers (2020-07-27T11:28:32Z)
Maximizing Information Gain in Partially Observable Environments via Prediction Reward [64.24528565312463]
This paper tackles the challenge of using belief-based rewards for a deep RL agent. We derive the exact error between negative entropy and the expected prediction reward. This insight provides theoretical motivation for several fields using prediction rewards.
arXiv Detail & Related papers (2020-05-11T08:13:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.