Related papers: Exploration by Random Reward Perturbation

Exploration by Random Reward Perturbation

URL: http://arxiv.org/abs/2506.08737v1
Date: Tue, 10 Jun 2025 12:34:00 GMT
Title: Exploration by Random Reward Perturbation
Authors: Haozhe Ma, Guoji Fu, Zhengding Luo, Jiele Wu, Tze-Yun Leong,
Abstract summary: We introduce Random Reward Perturbation (RRP), a novel exploration strategy for reinforcement learning (RL)<n>Our theoretical analyses demonstrate that adding zero-mean noise to environmental rewards effectively enhances policy diversity during training.<n>RRP is fully compatible with the action-perturbation-based exploration strategies, such as $epsilon$-greedy, policies, and entropy regularization.
Score: 6.293868056239738
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce Random Reward Perturbation (RRP), a novel exploration strategy for reinforcement learning (RL). Our theoretical analyses demonstrate that adding zero-mean noise to environmental rewards effectively enhances policy diversity during training, thereby expanding the range of exploration. RRP is fully compatible with the action-perturbation-based exploration strategies, such as $\epsilon$-greedy, stochastic policies, and entropy regularization, providing additive improvements to exploration effects. It is general, lightweight, and can be integrated into existing RL algorithms with minimal implementation effort and negligible computational overhead. RRP establishes a theoretical connection between reward shaping and noise-driven exploration, highlighting their complementary potential. Experiments show that RRP significantly boosts the performance of Proximal Policy Optimization and Soft Actor-Critic, achieving higher sample efficiency and escaping local optima across various tasks, under both sparse and dense reward scenarios.

Related papers

On Efficient Bayesian Exploration in Model-Based Reinforcement Learning [0.24578723416255752]
We address the challenge of data-efficient exploration in reinforcement learning by examining existing principled, information-theoretic approaches to intrinsic motivation.<n>We prove that exploration bonuses naturally signal epistemic information gains and converge to zero once the agent becomes sufficiently certain about the environment's dynamics and rewards.<n>We then outline a general framework - Predictive Trajectory Sampling with Bayesian Exploration (PTS-BE) - which integrates model-based planning with information-theoretic bonuses to achieve sample-efficient deep exploration.
arXiv Detail & Related papers (2025-07-03T14:03:47Z)
Leveraging Genetic Algorithms for Efficient Demonstration Generation in Real-World Reinforcement Learning Environments [0.8602553195689513]
Reinforcement Learning (RL) has demonstrated significant potential in certain real-world industrial applications.<n>This study investigates the utilization of Genetic Algorithms (GAs) as a mechanism for improving RL performance.<n>We propose a novel approach in which GA-generated expert demonstrations are used to enhance policy learning.
arXiv Detail & Related papers (2025-07-01T14:04:17Z)
Random Latent Exploration for Deep Reinforcement Learning [71.88709402926415]
We introduce Random Latent Exploration (RLE), a simple yet effective exploration strategy in reinforcement learning (RL)<n>On average, RLE outperforms noise-based methods, which perturb the agent's actions, and bonus-based exploration, which rewards the agent for attempting novel behaviors.<n>RLE is as simple as noise-based methods, as it avoids complex bonus calculations but retains the deep exploration benefits of bonus-based methods.
arXiv Detail & Related papers (2024-07-18T17:55:22Z)
Reinforcement Learning from Bagged Reward [46.16904382582698]
In Reinforcement Learning (RL), it is commonly assumed that an immediate reward signal is generated for each action taken by the agent. In many real-world scenarios, designing immediate reward signals is difficult. We propose a novel reward redistribution method equipped with a bidirectional attention mechanism.
arXiv Detail & Related papers (2024-02-06T07:26:44Z)
Truncating Trajectories in Monte Carlo Reinforcement Learning [48.97155920826079]
In Reinforcement Learning (RL), an agent acts in an unknown environment to maximize the expected cumulative discounted sum of an external reward signal. We propose an a-priori budget allocation strategy that leads to the collection of trajectories of different lengths. We show that an appropriate truncation of the trajectories can succeed in improving performance.
arXiv Detail & Related papers (2023-05-07T19:41:57Z)
Deep RL with Hierarchical Action Exploration for Dialogue Generation [0.0]
This paper presents theoretical analysis and experiments that reveal the performance of the dialogue policy is positively correlated with the sampling size. We introduce a novel dual-granularity Q-function that explores the most promising response category to intervene in the sampling process. Our algorithm exhibits both explainability and controllability and generates responses with higher expected rewards.
arXiv Detail & Related papers (2023-03-22T09:29:22Z)
Rewarding Episodic Visitation Discrepancy for Exploration in Reinforcement Learning [64.8463574294237]
We propose Rewarding Episodic Visitation Discrepancy (REVD) as an efficient and quantified exploration method. REVD provides intrinsic rewards by evaluating the R'enyi divergence-based visitation discrepancy between episodes. It is tested on PyBullet Robotics Environments and Atari games.
arXiv Detail & Related papers (2022-09-19T08:42:46Z)
Continuously Discovering Novel Strategies via Reward-Switching Policy Optimization [9.456388509414046]
Reward-Switching Policy Optimization (RSPO) RSPO is a paradigm to discover diverse strategies in complex RL environments by iteratively finding novel policies that are both locally optimal and sufficiently different from existing ones. Experiments show that RSPO is able to discover a wide spectrum of strategies in a variety of domains, ranging from single-agent particle-world tasks and MuJoCo continuous control to multi-agent stag-hunt games and StarCraftII challenges.
arXiv Detail & Related papers (2022-04-04T12:38:58Z)
On Reward-Free RL with Kernel and Neural Function Approximations: Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function. We tackle this problem under the context of function approximation, leveraging powerful function approximators. We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z)
Exploration by Maximizing R\'enyi Entropy for Reward-Free RL Framework [28.430845498323745]
We consider a reward-free reinforcement learning framework that separates exploration from exploitation. In the exploration phase, the agent learns an exploratory policy by interacting with a reward-free environment. In the planning phase, the agent computes a good policy for any reward function based on the dataset.
arXiv Detail & Related papers (2020-06-11T05:05:31Z)
Deep Reinforcement Learning with Robust and Smooth Policy [90.78795857181727]
We propose to learn a smooth policy that behaves smoothly with respect to states. We develop a new framework -- textbfSmooth textbfRegularized textbfReinforcement textbfLearning ($textbfSR2textbfL$), where the policy is trained with smoothness-inducing regularization. Such regularization effectively constrains the search space, and enforces smoothness in the learned policy.
arXiv Detail & Related papers (2020-03-21T00:10:29Z)
Discrete Action On-Policy Learning with Action-Value Critic [72.20609919995086]
Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension. We construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation. These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques.
arXiv Detail & Related papers (2020-02-10T04:23:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.