CAMEO: Curiosity Augmented Metropolis for Exploratory Optimal Policies
- URL: http://arxiv.org/abs/2205.09433v1
- Date: Thu, 19 May 2022 09:48:56 GMT
- Title: CAMEO: Curiosity Augmented Metropolis for Exploratory Optimal Policies
- Authors: Mohamed Alami Chehboune and Fernando Llorente and Rim Kaddah and Luca
Martino and Jesse Read
- Abstract summary: We consider and study a distribution of optimal policies.
In experimental simulations we show that CAMEO indeed obtains policies that all solve classic control problems.
We further show that the different policies we sample present different risk profiles, corresponding to interesting practical applications in interpretability.
- Score: 62.39667564455059
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement Learning has drawn huge interest as a tool for solving optimal
control problems. Solving a given problem (task or environment) involves
converging towards an optimal policy. However, there might exist multiple
optimal policies that can dramatically differ in their behaviour; for example,
some may be faster than the others but at the expense of greater risk. We
consider and study a distribution of optimal policies. We design a
curiosity-augmented Metropolis algorithm (CAMEO), such that we can sample
optimal policies, and such that these policies effectively adopt diverse
behaviours, since this implies greater coverage of the different possible
optimal policies. In experimental simulations we show that CAMEO indeed obtains
policies that all solve classic control problems, and even in the challenging
case of environments that provide sparse rewards. We further show that the
different policies we sample present different risk profiles, corresponding to
interesting practical applications in interpretability, and represents a first
step towards learning the distribution of optimal policies itself.
Related papers
- Oracle-Efficient Reinforcement Learning for Max Value Ensembles [7.404901768256101]
Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, theoretically and experimentally.
In this work we aim to compete with the $textitmax-following policy$, which at each state follows the action of whichever constituent policy has the highest value.
Our main result is an efficient algorithm that learns to compete with the max-following policy, given only access to the constituent policies.
arXiv Detail & Related papers (2024-05-27T01:08:23Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Discovering Policies with DOMiNO: Diversity Optimization Maintaining
Near Optimality [26.69352834457256]
We formalize the problem as a Constrained Markov Decision Process.
The objective is to find diverse policies, measured by the distance between the state occupancies of the policies in the set.
We demonstrate that the method can discover diverse and meaningful behaviors in various domains.
arXiv Detail & Related papers (2022-05-26T17:40:52Z) - Fast Model-based Policy Search for Universal Policy Networks [45.44896435487879]
Adapting an agent's behaviour to new environments has been one of the primary focus areas of physics based reinforcement learning.
We propose a Gaussian Process-based prior learned in simulation, that captures the likely performance of a policy when transferred to a previously unseen environment.
We integrate this prior with a Bayesian optimisation-based policy search process to improve the efficiency of identifying the most appropriate policy from the universal policy network.
arXiv Detail & Related papers (2022-02-11T18:08:02Z) - Understanding the Effect of Stochasticity in Policy Optimization [86.7574122154668]
We show that the preferability of optimization methods depends critically on whether exact gradients are used.
Second, to explain these findings we introduce the concept of committal rate for policy optimization.
Third, we show that in the absence of external oracle information, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely.
arXiv Detail & Related papers (2021-10-29T06:35:44Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Policy Optimization as Online Learning with Mediator Feedback [46.845765216238135]
Policy Optimization (PO) is a widely used approach to address continuous control tasks.
In this paper, we introduce the notion of mediator feedback that frames PO as an online learning problem over the policy space.
We propose an algorithm, RANDomized-exploration policy Optimization via Multiple Importance Sampling with Truncation (RIST) for regret minimization.
arXiv Detail & Related papers (2020-12-15T11:34:29Z) - Variational Policy Propagation for Multi-agent Reinforcement Learning [68.26579560607597]
We propose a emphcollaborative multi-agent reinforcement learning algorithm named variational policy propagation (VPP) to learn a emphjoint policy through the interactions over agents.
We prove that the joint policy is a Markov Random Field under some mild conditions, which in turn reduces the policy space effectively.
We integrate the variational inference as special differentiable layers in policy such as the actions can be efficiently sampled from the Markov Random Field and the overall policy is differentiable.
arXiv Detail & Related papers (2020-04-19T15:42:55Z) - Improving Generalization of Reinforcement Learning with Minimax
Distributional Soft Actor-Critic [11.601356612579641]
This paper introduces the minimax formulation and distributional framework to improve the generalization ability of RL algorithms.
We implement our method on the decision-making tasks of autonomous vehicles at intersections and test the trained policy in distinct environments.
arXiv Detail & Related papers (2020-02-13T14:09:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.