Bag of Policies for Distributional Deep Exploration
- URL: http://arxiv.org/abs/2308.01759v1
- Date: Thu, 3 Aug 2023 13:43:03 GMT
- Title: Bag of Policies for Distributional Deep Exploration
- Authors: Asen Nachkov and Luchen Li and Giulia Luise and Filippo Valdettaro and
Aldo Faisal
- Abstract summary: Bag of Policies (BoP) is built on top of any return distribution estimator by maintaining a population of its copies.
During training, each episode is controlled by only one of the heads and the collected state-action pairs are used to update all heads off-policy.
BoP results in greater robustness and speed during learning as demonstrated by our experimental results on ALE Atari games.
- Score: 7.522221438479138
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Efficient exploration in complex environments remains a major challenge for
reinforcement learning (RL). Compared to previous Thompson sampling-inspired
mechanisms that enable temporally extended exploration, i.e., deep exploration,
we focus on deep exploration in distributional RL. We develop here a general
purpose approach, Bag of Policies (BoP), that can be built on top of any return
distribution estimator by maintaining a population of its copies. BoP consists
of an ensemble of multiple heads that are updated independently. During
training, each episode is controlled by only one of the heads and the collected
state-action pairs are used to update all heads off-policy, leading to distinct
learning signals for each head which diversify learning and behaviour. To test
whether optimistic ensemble method can improve on distributional RL as did on
scalar RL, by e.g. Bootstrapped DQN, we implement the BoP approach with a
population of distributional actor-critics using Bayesian Distributional Policy
Gradients (BDPG). The population thus approximates a posterior distribution of
return distributions along with a posterior distribution of policies. Another
benefit of building upon BDPG is that it allows to analyze global posterior
uncertainty along with local curiosity bonus simultaneously for exploration. As
BDPG is already an optimistic method, this pairing helps to investigate if
optimism is accumulatable in distributional RL. Overall BoP results in greater
robustness and speed during learning as demonstrated by our experimental
results on ALE Atari games.
Related papers
- On the Importance of Exploration for Generalization in Reinforcement
Learning [89.63074327328765]
We propose EDE: Exploration via Distributional Ensemble, a method that encourages exploration of states with high uncertainty.
Our algorithm is the first value-based approach to achieve state-of-the-art on both Procgen and Crafter.
arXiv Detail & Related papers (2023-06-08T18:07:02Z) - Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo [104.9535542833054]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL)
We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo.
Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
arXiv Detail & Related papers (2023-05-29T17:11:28Z) - One-Step Distributional Reinforcement Learning [10.64435582017292]
We present the simpler one-step distributional reinforcement learning (OS-DistrRL) framework.
We show that our approach comes with a unified theory for both policy evaluation and control.
We propose two OS-DistrRL algorithms for which we provide an almost sure convergence analysis.
arXiv Detail & Related papers (2023-04-27T06:57:00Z) - Policy Evaluation in Distributional LQR [70.63903506291383]
We provide a closed-form expression of the distribution of the random return.
We show that this distribution can be approximated by a finite number of random variables.
Using the approximate return distribution, we propose a zeroth-order policy gradient algorithm for risk-averse LQR.
arXiv Detail & Related papers (2023-03-23T20:27:40Z) - BADDr: Bayes-Adaptive Deep Dropout RL for POMDPs [22.78390558602203]
We present a representation-agnostic formulation of BRL under partially observability, unifying the previous models under one theoretical umbrella.
We also propose a novel derivation, Bayes-Adaptive Deep Dropout rl (BADDr), based on dropout networks.
arXiv Detail & Related papers (2022-02-17T19:48:35Z) - Exploration with Multi-Sample Target Values for Distributional
Reinforcement Learning [20.680417111485305]
We introduce multi-sample target values (MTV) for distributional RL, as a principled replacement for single-sample target value estimation.
The improved distributional estimates lend themselves to UCB-based exploration.
We evaluate our approach on a range of continuous control tasks and demonstrate state-of-the-art model-free performance on difficult tasks such as Humanoid control.
arXiv Detail & Related papers (2022-02-06T03:27:05Z) - Distributional Reinforcement Learning for Multi-Dimensional Reward
Functions [91.88969237680669]
We introduce Multi-Dimensional Distributional DQN (MD3QN) to model the joint return distribution from multiple reward sources.
As a by-product of joint distribution modeling, MD3QN can capture the randomness in returns for each source of reward.
In experiments, our method accurately models the joint return distribution in environments with richly correlated reward functions.
arXiv Detail & Related papers (2021-10-26T11:24:23Z) - Bayesian Distributional Policy Gradients [2.28438857884398]
Distributional Reinforcement Learning maintains the entire probability distribution of the reward-to-go, i.e. the return.
Bayesian Distributional Policy Gradients (BDPG) uses adversarial training in joint-contrastive learning to estimate a variational posterior from the returns.
arXiv Detail & Related papers (2021-03-20T23:42:50Z) - Distributional Reinforcement Learning via Moment Matching [54.16108052278444]
We formulate a method that learns a finite set of statistics from each return distribution via neural networks.
Our method can be interpreted as implicitly matching all orders of moments between a return distribution and its Bellman target.
Experiments on the suite of Atari games show that our method outperforms the standard distributional RL baselines.
arXiv Detail & Related papers (2020-07-24T05:18:17Z) - Never Give Up: Learning Directed Exploration Strategies [63.19616370038824]
We propose a reinforcement learning agent to solve hard exploration games by learning a range of directed exploratory policies.
We construct an episodic memory-based intrinsic reward using k-nearest neighbors over the agent's recent experience to train the directed exploratory policies.
A self-supervised inverse dynamics model is used to train the embeddings of the nearest neighbour lookup, biasing the novelty signal towards what the agent can control.
arXiv Detail & Related papers (2020-02-14T13:57:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.