Rank the Episodes: A Simple Approach for Exploration in
Procedurally-Generated Environments
- URL: http://arxiv.org/abs/2101.08152v2
- Date: Thu, 4 Feb 2021 15:48:12 GMT
- Title: Rank the Episodes: A Simple Approach for Exploration in
Procedurally-Generated Environments
- Authors: Daochen Zha, Wenye Ma, Lei Yuan, Xia Hu, Ji Liu
- Abstract summary: Methods based on intrinsic rewards often fall short in procedurally-generated environments.
We introduce RAPID, a simple yet effective episode-level exploration method for procedurally-generated environments.
We demonstrate our method on several procedurally-generated MiniGrid environments, a first-person-view 3D Maze navigation task from MiniWorld, and several sparse MuJoCo tasks.
- Score: 66.80667987347151
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Exploration under sparse reward is a long-standing challenge of model-free
reinforcement learning. The state-of-the-art methods address this challenge by
introducing intrinsic rewards to encourage exploration in novel states or
uncertain environment dynamics. Unfortunately, methods based on intrinsic
rewards often fall short in procedurally-generated environments, where a
different environment is generated in each episode so that the agent is not
likely to visit the same state more than once. Motivated by how humans
distinguish good exploration behaviors by looking into the entire episode, we
introduce RAPID, a simple yet effective episode-level exploration method for
procedurally-generated environments. RAPID regards each episode as a whole and
gives an episodic exploration score from both per-episode and long-term views.
Those highly scored episodes are treated as good exploration behaviors and are
stored in a small ranking buffer. The agent then imitates the episodes in the
buffer to reproduce the past good exploration behaviors. We demonstrate our
method on several procedurally-generated MiniGrid environments, a
first-person-view 3D Maze navigation task from MiniWorld, and several sparse
MuJoCo tasks. The results show that RAPID significantly outperforms the
state-of-the-art intrinsic reward strategies in terms of sample efficiency and
final performance. The code is available at https://github.com/daochenzha/rapid
Related papers
- Neuro-Inspired Fragmentation and Recall to Overcome Catastrophic
Forgetting in Curiosity [31.396929282048916]
Deep reinforcement learning methods exhibit impressive performance on a range of tasks but struggle on hard exploration tasks in large environments with sparse rewards.
Prediction-based intrinsic rewards can help agents solve hard exploration tasks, but they can suffer from catastrophic forgetting.
We propose a new method FARCuriosity, inspired by how humans and animals learn.
arXiv Detail & Related papers (2023-10-26T16:28:17Z) - Exploration via Elliptical Episodic Bonuses [22.404871878551354]
We introduce Exploration via Episodic Bonuses (E3B), a new method which extends count-based episodic bonuses to continuous state spaces.
Our method sets a new state-of-the-art across 16 challenging tasks from the MiniHack suite, without requiring task-specific inductive biases.
E3B also matches existing methods on sparse reward, pixel-based VizDoom environments, and outperforms existing methods in reward-free exploration on Habitat.
arXiv Detail & Related papers (2022-10-11T22:10:23Z) - Long-Term Exploration in Persistent MDPs [68.8204255655161]
We propose an exploration method called Rollback-Explore (RbExplore)
In this paper, we propose an exploration method called Rollback-Explore (RbExplore), which utilizes the concept of the persistent Markov decision process.
We test our algorithm in the hard-exploration Prince of Persia game, without rewards and domain knowledge.
arXiv Detail & Related papers (2021-09-21T13:47:04Z) - MADE: Exploration via Maximizing Deviation from Explored Regions [48.49228309729319]
In online reinforcement learning (RL), efficient exploration remains challenging in high-dimensional environments with sparse rewards.
We propose a new exploration approach via textitmaximizing the deviation of the occupancy of the next policy from the explored regions.
Our approach significantly improves sample efficiency over state-of-the-art methods.
arXiv Detail & Related papers (2021-06-18T17:57:00Z) - BeBold: Exploration Beyond the Boundary of Explored Regions [66.88415950549556]
In this paper, we propose the regulated difference of inverse visitation counts as a simple but effective criterion for intrinsic reward (IR)
The criterion helps the agent explore Beyond the Boundary of explored regions and mitigates common issues in count-based methods, such as short-sightedness and detachment.
The resulting method, BeBold, solves the 12 most challenging procedurally-generated tasks in MiniGrid with just 120M environment steps, without any curriculum learning.
arXiv Detail & Related papers (2020-12-15T21:26:54Z) - Fast active learning for pure exploration in reinforcement learning [48.98199700043158]
We show that bonuses that scale with $1/n$ bring faster learning rates, improving the known upper bounds with respect to the dependence on the horizon.
We also show that with an improved analysis of the stopping time, we can improve by a factor $H$ the sample complexity in the best-policy identification setting.
arXiv Detail & Related papers (2020-07-27T11:28:32Z) - RIDE: Rewarding Impact-Driven Exploration for Procedurally-Generated
Environments [15.736899098702972]
We propose a novel type of intrinsic reward which encourages the agent to take actions that lead to significant changes in its learned state representation.
We evaluate our method on multiple challenging procedurally-generated tasks in MiniGrid.
arXiv Detail & Related papers (2020-02-27T18:03:16Z) - Never Give Up: Learning Directed Exploration Strategies [63.19616370038824]
We propose a reinforcement learning agent to solve hard exploration games by learning a range of directed exploratory policies.
We construct an episodic memory-based intrinsic reward using k-nearest neighbors over the agent's recent experience to train the directed exploratory policies.
A self-supervised inverse dynamics model is used to train the embeddings of the nearest neighbour lookup, biasing the novelty signal towards what the agent can control.
arXiv Detail & Related papers (2020-02-14T13:57:22Z) - Long-Term Visitation Value for Deep Exploration in Sparse Reward
Reinforcement Learning [34.38011902445557]
Reinforcement learning with sparse rewards is still an open challenge.
We present a novel approach that plans exploration actions far into the future by using a long-term visitation count.
Contrary to existing methods which use models of reward and dynamics, our approach is off-policy and model-free.
arXiv Detail & Related papers (2020-01-01T01:01:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.