Rethinking Exploration for Sample-Efficient Policy Learning
- URL: http://arxiv.org/abs/2101.09458v1
- Date: Sat, 23 Jan 2021 08:51:04 GMT
- Title: Rethinking Exploration for Sample-Efficient Policy Learning
- Authors: William F. Whitney, Michael Bloesch, Jost Tobias Springenberg, Abbas
Abdolmaleki, Martin Riedmiller
- Abstract summary: We show that directed exploration methods have not been more influential in the sample efficient control problem.
Three issues have limited the applicability of BBE: bias with finite samples, slow adaptation to decaying bonuses, and lack of optimism on unseen transitions.
We propose modifications to the bonus-based exploration recipe to address each of these limitations.
The resulting algorithm, which we call UFO, produces policies that are Unbiased with finite samples, Fast-adapting as the exploration bonus changes, and Optimistic with respect to new transitions.
- Score: 20.573107021603356
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-policy reinforcement learning for control has made great strides in terms
of performance and sample efficiency. We suggest that for many tasks the sample
efficiency of modern methods is now limited by the richness of the data
collected rather than the difficulty of policy fitting. We examine the reasons
that directed exploration methods in the bonus-based exploration (BBE) family
have not been more influential in the sample efficient control problem. Three
issues have limited the applicability of BBE: bias with finite samples, slow
adaptation to decaying bonuses, and lack of optimism on unseen transitions. We
propose modifications to the bonus-based exploration recipe to address each of
these limitations. The resulting algorithm, which we call UFO, produces
policies that are Unbiased with finite samples, Fast-adapting as the
exploration bonus changes, and Optimistic with respect to new transitions. We
include experiments showing that rapid directed exploration is a promising
direction to improve sample efficiency for control.
Related papers
- When Do Off-Policy and On-Policy Policy Gradient Methods Align? [15.7221450531432]
Policy gradient methods are widely adopted reinforcement learning algorithms for tasks with continuous action spaces.
A common way to improve sample efficiency is to modify their objective function to be computable from off-policy samples without importance sampling.
This work studies the difference between the excursion objective and the traditional on-policy objective, which we refer to as the on-off gap.
arXiv Detail & Related papers (2024-02-19T10:42:34Z) - Penalized Proximal Policy Optimization for Safe Reinforcement Learning [68.86485583981866]
We propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem.
P3O utilizes a simple-yet-effective penalty function to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective.
We show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.
arXiv Detail & Related papers (2022-05-24T06:15:51Z) - Shortest-Path Constrained Reinforcement Learning for Sparse Reward Tasks [59.419152768018506]
We show that any optimal policy necessarily satisfies the k-SP constraint.
We propose a novel cost function that penalizes the policy violating SP constraint, instead of completely excluding it.
Our experiments on MiniGrid, DeepMind Lab, Atari, and Fetch show that the proposed method significantly improves proximal policy optimization (PPO)
arXiv Detail & Related papers (2021-07-13T21:39:21Z) - MADE: Exploration via Maximizing Deviation from Explored Regions [48.49228309729319]
In online reinforcement learning (RL), efficient exploration remains challenging in high-dimensional environments with sparse rewards.
We propose a new exploration approach via textitmaximizing the deviation of the occupancy of the next policy from the explored regions.
Our approach significantly improves sample efficiency over state-of-the-art methods.
arXiv Detail & Related papers (2021-06-18T17:57:00Z) - Optimal Off-Policy Evaluation from Multiple Logging Policies [77.62012545592233]
We study off-policy evaluation from multiple logging policies, each generating a dataset of fixed size, i.e., stratified sampling.
We find the OPE estimator for multiple loggers with minimum variance for any instance, i.e., the efficient one.
arXiv Detail & Related papers (2020-10-21T13:43:48Z) - Provably Efficient Reward-Agnostic Navigation with Linear Value
Iteration [143.43658264904863]
We show how iteration under a more standard notion of low inherent Bellman error, typically employed in least-square value-style algorithms, can provide strong PAC guarantees on learning a near optimal value function.
We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function.
arXiv Detail & Related papers (2020-08-18T04:34:21Z) - DDPG++: Striving for Simplicity in Continuous-control Off-Policy
Reinforcement Learning [95.60782037764928]
We show that simple Deterministic Policy Gradient works remarkably well as long as the overestimation bias is controlled.
Second, we pinpoint training instabilities, typical of off-policy algorithms, to the greedy policy update step.
Third, we show that ideas in the propensity estimation literature can be used to importance-sample transitions from replay buffer and update policy to prevent deterioration of performance.
arXiv Detail & Related papers (2020-06-26T20:21:12Z) - Adaptive Experience Selection for Policy Gradient [8.37609145576126]
Experience replay is a commonly used approach to improve sample efficiency.
gradient estimators using past trajectories typically have high variance.
Existing sampling strategies for experience replay like uniform sampling or prioritised experience replay do not explicitly try to control the variance of the gradient estimates.
We propose an online learning algorithm, adaptive experience selection (AES), to adaptively learn an experience sampling distribution that explicitly minimises this variance.
arXiv Detail & Related papers (2020-02-17T13:16:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.