Related papers: Redeeming Intrinsic Rewards via Constrained Optimization

Redeeming Intrinsic Rewards via Constrained Optimization

URL: http://arxiv.org/abs/2211.07627v1
Date: Mon, 14 Nov 2022 18:49:26 GMT
Title: Redeeming Intrinsic Rewards via Constrained Optimization
Authors: Eric Chen, Zhang-Wei Hong, Joni Pajarinen, Pulkit Agrawal
Abstract summary: State-of-the-art reinforcement learning (RL) algorithms typically use random sampling (e.g., $epsilon$-greedy) for exploration, but this method fails in hard exploration tasks like Montezuma's Revenge. Prior works incentivize the agent to visit novel states using an exploration bonus (also called an intrinsic reward or curiosity) Such methods can lead to excellent results on hard exploration tasks but can suffer from intrinsic reward bias and underperform when compared to an agent trained using only task rewards. We propose a principled constrained policy optimization procedure that automatically tunes the importance of the intrinsic reward.
Score: 17.203887958936168
License: http://creativecommons.org/licenses/by/4.0/
Abstract: State-of-the-art reinforcement learning (RL) algorithms typically use random sampling (e.g., $\epsilon$-greedy) for exploration, but this method fails in hard exploration tasks like Montezuma's Revenge. To address the challenge of exploration, prior works incentivize the agent to visit novel states using an exploration bonus (also called an intrinsic reward or curiosity). Such methods can lead to excellent results on hard exploration tasks but can suffer from intrinsic reward bias and underperform when compared to an agent trained using only task rewards. This performance decrease occurs when an agent seeks out intrinsic rewards and performs unnecessary exploration even when sufficient task reward is available. This inconsistency in performance across tasks prevents the widespread use of intrinsic rewards with RL algorithms. We propose a principled constrained policy optimization procedure that automatically tunes the importance of the intrinsic reward: it suppresses the intrinsic reward when exploration is unnecessary and increases it when exploration is required. This results in superior exploration that does not require manual tuning to balance the intrinsic reward against the task reward. Consistent performance gains across sixty-one ATARI games validate our claim. The code is available at https://github.com/Improbable-AI/eipo.

Related papers

MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization [91.80034860399677]
Reinforcement learning algorithms aim to balance exploiting the current best strategy with exploring new options that could lead to higher rewards. We introduce a framework, MaxInfoRL, for balancing intrinsic and extrinsic exploration. We show that our approach achieves sublinear regret in the simplified setting of multi-armed bandits.
arXiv Detail & Related papers (2024-12-16T18:59:53Z)
Random Latent Exploration for Deep Reinforcement Learning [71.88709402926415]
We introduce Random Latent Exploration (RLE), a simple yet effective exploration strategy in reinforcement learning (RL) On average, RLE outperforms noise-based methods, which perturb the agent's actions, and bonus-based exploration, which rewards the agent for attempting novel behaviors. RLE is as simple as noise-based methods, as it avoids complex bonus calculations but retains the deep exploration benefits of bonus-based methods.
arXiv Detail & Related papers (2024-07-18T17:55:22Z)
Beyond Optimism: Exploration With Partially Observable Rewards [10.571972176725371]
Exploration in reinforcement learning (RL) remains an open challenge. We present a novel strategy that overcomes the limitations of existing methods and guarantees convergence to an optimal policy.
arXiv Detail & Related papers (2024-06-20T00:42:02Z)
Accelerating Exploration with Unlabeled Prior Data [66.43995032226466]
We study how prior data without reward labels may be used to guide and accelerate exploration for an agent solving a new sparse reward task. We propose a simple approach that learns a reward model from online experience, labels the unlabeled prior data with optimistic rewards, and then uses it concurrently alongside the online data for downstream policy and critic optimization.
arXiv Detail & Related papers (2023-11-09T00:05:17Z)
Go Beyond Imagination: Maximizing Episodic Reachability with World Models [68.91647544080097]
In this paper, we introduce a new intrinsic reward design called GoBI - Go Beyond Imagination. We apply learned world models to generate predicted future states with random actions. Our method greatly outperforms previous state-of-the-art methods on 12 of the most challenging Minigrid navigation tasks.
arXiv Detail & Related papers (2023-08-25T20:30:20Z)
Successor-Predecessor Intrinsic Exploration [18.440869985362998]
We focus on exploration with intrinsic rewards, where the agent transiently augments the external rewards with self-generated intrinsic rewards. We propose Successor-Predecessor Intrinsic Exploration (SPIE), an exploration algorithm based on a novel intrinsic reward combining prospective and retrospective information. We show that SPIE yields more efficient and ethologically plausible exploratory behaviour in environments with sparse rewards and bottleneck states than competing methods.
arXiv Detail & Related papers (2023-05-24T16:02:51Z)
Sparse Reward Exploration via Novelty Search and Emitters [55.41644538483948]
We introduce the SparsE Reward Exploration via Novelty and Emitters (SERENE) algorithm. SERENE separates the search space exploration and reward exploitation into two alternating processes. A meta-scheduler allocates a global computational budget by alternating between the two processes.
arXiv Detail & Related papers (2021-02-05T12:34:54Z)
Semi-supervised reward learning for offline reinforcement learning [71.6909757718301]
Training agents usually requires reward functions, but rewards are seldom available in practice and their engineering is challenging and laborious. We propose semi-supervised learning algorithms that learn from limited annotations and incorporate unlabelled data. In our experiments with a simulated robotic arm, we greatly improve upon behavioural cloning and closely approach the performance achieved with ground truth rewards.
arXiv Detail & Related papers (2020-12-12T20:06:15Z)
Action Guidance: Getting the Best of Sparse Rewards and Shaped Rewards for Real-time Strategy Games [0.0]
Training agents using Reinforcement Learning in games with sparse rewards is a challenging problem. We present a novel technique that successfully trains agents to eventually optimize the true objective in games with sparse rewards.
arXiv Detail & Related papers (2020-10-05T03:43:06Z)
Fast active learning for pure exploration in reinforcement learning [48.98199700043158]
We show that bonuses that scale with $1/n$ bring faster learning rates, improving the known upper bounds with respect to the dependence on the horizon. We also show that with an improved analysis of the stopping time, we can improve by a factor $H$ the sample complexity in the best-policy identification setting.
arXiv Detail & Related papers (2020-07-27T11:28:32Z)
Reward-Free Exploration for Reinforcement Learning [82.3300753751066]
We propose a new "reward-free RL" framework to isolate the challenges of exploration. We give an efficient algorithm that conducts $tildemathcalO(S2Amathrmpoly(H)/epsilon2)$ episodes of exploration. We also give a nearly-matching $Omega(S2AH2/epsilon2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.
arXiv Detail & Related papers (2020-02-07T14:03:38Z)
Long-Term Visitation Value for Deep Exploration in Sparse Reward Reinforcement Learning [34.38011902445557]
Reinforcement learning with sparse rewards is still an open challenge. We present a novel approach that plans exploration actions far into the future by using a long-term visitation count. Contrary to existing methods which use models of reward and dynamics, our approach is off-policy and model-free.
arXiv Detail & Related papers (2020-01-01T01:01:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.