Redeeming Intrinsic Rewards via Constrained Optimization
- URL: http://arxiv.org/abs/2211.07627v1
- Date: Mon, 14 Nov 2022 18:49:26 GMT
- Title: Redeeming Intrinsic Rewards via Constrained Optimization
- Authors: Eric Chen, Zhang-Wei Hong, Joni Pajarinen, Pulkit Agrawal
- Abstract summary: State-of-the-art reinforcement learning (RL) algorithms typically use random sampling (e.g., $epsilon$-greedy) for exploration, but this method fails in hard exploration tasks like Montezuma's Revenge.
Prior works incentivize the agent to visit novel states using an exploration bonus (also called an intrinsic reward or curiosity)
Such methods can lead to excellent results on hard exploration tasks but can suffer from intrinsic reward bias and underperform when compared to an agent trained using only task rewards.
We propose a principled constrained policy optimization procedure that automatically tunes the importance of the intrinsic reward.
- Score: 17.203887958936168
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: State-of-the-art reinforcement learning (RL) algorithms typically use random
sampling (e.g., $\epsilon$-greedy) for exploration, but this method fails in
hard exploration tasks like Montezuma's Revenge. To address the challenge of
exploration, prior works incentivize the agent to visit novel states using an
exploration bonus (also called an intrinsic reward or curiosity). Such methods
can lead to excellent results on hard exploration tasks but can suffer from
intrinsic reward bias and underperform when compared to an agent trained using
only task rewards. This performance decrease occurs when an agent seeks out
intrinsic rewards and performs unnecessary exploration even when sufficient
task reward is available. This inconsistency in performance across tasks
prevents the widespread use of intrinsic rewards with RL algorithms. We propose
a principled constrained policy optimization procedure that automatically tunes
the importance of the intrinsic reward: it suppresses the intrinsic reward when
exploration is unnecessary and increases it when exploration is required. This
results in superior exploration that does not require manual tuning to balance
the intrinsic reward against the task reward. Consistent performance gains
across sixty-one ATARI games validate our claim. The code is available at
https://github.com/Improbable-AI/eipo.
Related papers
- MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization [91.80034860399677]
Reinforcement learning algorithms aim to balance exploiting the current best strategy with exploring new options that could lead to higher rewards.
We introduce a framework, MaxInfoRL, for balancing intrinsic and extrinsic exploration.
We show that our approach achieves sublinear regret in the simplified setting of multi-armed bandits.
arXiv Detail & Related papers (2024-12-16T18:59:53Z) - Accelerating Exploration with Unlabeled Prior Data [66.43995032226466]
We study how prior data without reward labels may be used to guide and accelerate exploration for an agent solving a new sparse reward task.
We propose a simple approach that learns a reward model from online experience, labels the unlabeled prior data with optimistic rewards, and then uses it concurrently alongside the online data for downstream policy and critic optimization.
arXiv Detail & Related papers (2023-11-09T00:05:17Z) - Go Beyond Imagination: Maximizing Episodic Reachability with World
Models [68.91647544080097]
In this paper, we introduce a new intrinsic reward design called GoBI - Go Beyond Imagination.
We apply learned world models to generate predicted future states with random actions.
Our method greatly outperforms previous state-of-the-art methods on 12 of the most challenging Minigrid navigation tasks.
arXiv Detail & Related papers (2023-08-25T20:30:20Z) - Successor-Predecessor Intrinsic Exploration [18.440869985362998]
We focus on exploration with intrinsic rewards, where the agent transiently augments the external rewards with self-generated intrinsic rewards.
We propose Successor-Predecessor Intrinsic Exploration (SPIE), an exploration algorithm based on a novel intrinsic reward combining prospective and retrospective information.
We show that SPIE yields more efficient and ethologically plausible exploratory behaviour in environments with sparse rewards and bottleneck states than competing methods.
arXiv Detail & Related papers (2023-05-24T16:02:51Z) - Sparse Reward Exploration via Novelty Search and Emitters [55.41644538483948]
We introduce the SparsE Reward Exploration via Novelty and Emitters (SERENE) algorithm.
SERENE separates the search space exploration and reward exploitation into two alternating processes.
A meta-scheduler allocates a global computational budget by alternating between the two processes.
arXiv Detail & Related papers (2021-02-05T12:34:54Z) - Action Guidance: Getting the Best of Sparse Rewards and Shaped Rewards
for Real-time Strategy Games [0.0]
Training agents using Reinforcement Learning in games with sparse rewards is a challenging problem.
We present a novel technique that successfully trains agents to eventually optimize the true objective in games with sparse rewards.
arXiv Detail & Related papers (2020-10-05T03:43:06Z) - Fast active learning for pure exploration in reinforcement learning [48.98199700043158]
We show that bonuses that scale with $1/n$ bring faster learning rates, improving the known upper bounds with respect to the dependence on the horizon.
We also show that with an improved analysis of the stopping time, we can improve by a factor $H$ the sample complexity in the best-policy identification setting.
arXiv Detail & Related papers (2020-07-27T11:28:32Z) - Reward-Free Exploration for Reinforcement Learning [82.3300753751066]
We propose a new "reward-free RL" framework to isolate the challenges of exploration.
We give an efficient algorithm that conducts $tildemathcalO(S2Amathrmpoly(H)/epsilon2)$ episodes of exploration.
We also give a nearly-matching $Omega(S2AH2/epsilon2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.
arXiv Detail & Related papers (2020-02-07T14:03:38Z) - Long-Term Visitation Value for Deep Exploration in Sparse Reward
Reinforcement Learning [34.38011902445557]
Reinforcement learning with sparse rewards is still an open challenge.
We present a novel approach that plans exploration actions far into the future by using a long-term visitation count.
Contrary to existing methods which use models of reward and dynamics, our approach is off-policy and model-free.
arXiv Detail & Related papers (2020-01-01T01:01:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.