MADE: Exploration via Maximizing Deviation from Explored Regions
- URL: http://arxiv.org/abs/2106.10268v1
- Date: Fri, 18 Jun 2021 17:57:00 GMT
- Title: MADE: Exploration via Maximizing Deviation from Explored Regions
- Authors: Tianjun Zhang, Paria Rashidinejad, Jiantao Jiao, Yuandong Tian, Joseph
Gonzalez, Stuart Russell
- Abstract summary: In online reinforcement learning (RL), efficient exploration remains challenging in high-dimensional environments with sparse rewards.
We propose a new exploration approach via textitmaximizing the deviation of the occupancy of the next policy from the explored regions.
Our approach significantly improves sample efficiency over state-of-the-art methods.
- Score: 48.49228309729319
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In online reinforcement learning (RL), efficient exploration remains
particularly challenging in high-dimensional environments with sparse rewards.
In low-dimensional environments, where tabular parameterization is possible,
count-based upper confidence bound (UCB) exploration methods achieve minimax
near-optimal rates. However, it remains unclear how to efficiently implement
UCB in realistic RL tasks that involve non-linear function approximation. To
address this, we propose a new exploration approach via \textit{maximizing} the
deviation of the occupancy of the next policy from the explored regions. We add
this term as an adaptive regularizer to the standard RL objective to balance
exploration vs. exploitation. We pair the new objective with a provably
convergent algorithm, giving rise to a new intrinsic reward that adjusts
existing bonuses. The proposed intrinsic reward is easy to implement and
combine with other existing RL algorithms to conduct exploration. As a proof of
concept, we evaluate the new intrinsic reward on tabular examples across a
variety of model-based and model-free algorithms, showing improvements over
count-only exploration strategies. When tested on navigation and locomotion
tasks from MiniGrid and DeepMind Control Suite benchmarks, our approach
significantly improves sample efficiency over state-of-the-art methods. Our
code is available at https://github.com/tianjunz/MADE.
Related papers
- CCE: Sample Efficient Sparse Reward Policy Learning for Robotic Navigation via Confidence-Controlled Exploration [72.24964965882783]
Confidence-Controlled Exploration (CCE) is designed to enhance the training sample efficiency of reinforcement learning algorithms for sparse reward settings such as robot navigation.
CCE is based on a novel relationship we provide between gradient estimation and policy entropy.
We demonstrate through simulated and real-world experiments that CCE outperforms conventional methods that employ constant trajectory lengths and entropy regularization.
arXiv Detail & Related papers (2023-06-09T18:45:15Z) - Maximize to Explore: One Objective Function Fusing Estimation, Planning,
and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX.
textttMEX integrates estimation and planning components while balancing exploration exploitation automatically.
It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z) - Reward Uncertainty for Exploration in Preference-based Reinforcement
Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms.
Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward.
Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z) - On Reward-Free RL with Kernel and Neural Function Approximations:
Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function.
We tackle this problem under the context of function approximation, leveraging powerful function approximators.
We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z) - Provably Efficient Reward-Agnostic Navigation with Linear Value
Iteration [143.43658264904863]
We show how iteration under a more standard notion of low inherent Bellman error, typically employed in least-square value-style algorithms, can provide strong PAC guarantees on learning a near optimal value function.
We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function.
arXiv Detail & Related papers (2020-08-18T04:34:21Z) - Long-Term Visitation Value for Deep Exploration in Sparse Reward
Reinforcement Learning [34.38011902445557]
Reinforcement learning with sparse rewards is still an open challenge.
We present a novel approach that plans exploration actions far into the future by using a long-term visitation count.
Contrary to existing methods which use models of reward and dynamics, our approach is off-policy and model-free.
arXiv Detail & Related papers (2020-01-01T01:01:15Z) - Dynamic Subgoal-based Exploration via Bayesian Optimization [7.297146495243708]
Reinforcement learning in sparse-reward navigation environments is challenging and poses a need for effective exploration.
We propose a cost-aware Bayesian optimization approach that efficiently searches over a class of dynamic subgoal-based exploration strategies.
An experimental evaluation demonstrates that the new approach outperforms existing baselines across a number of problem domains.
arXiv Detail & Related papers (2019-10-21T04:24:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.