Task-Agnostic Exploration via Policy Gradient of a Non-Parametric State
Entropy Estimate
- URL: http://arxiv.org/abs/2007.04640v2
- Date: Fri, 26 Feb 2021 20:48:49 GMT
- Title: Task-Agnostic Exploration via Policy Gradient of a Non-Parametric State
Entropy Estimate
- Authors: Mirco Mutti, Lorenzo Pratissoli, Marcello Restelli
- Abstract summary: In a reward-free environment, what is a suitable intrinsic objective for an agent to pursue so that it can learn an optimal task-agnostic exploration policy?
We argue that the entropy of the state distribution induced by finite-horizon trajectories is a sensible target.
We present a novel and practical policy-search algorithm, Maximum Entropy POLicy optimization (MEPOL), to learn a policy that maximizes a non-parametric, $k$-nearest neighbors estimate of the state distribution entropy.
- Score: 40.97686031763918
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In a reward-free environment, what is a suitable intrinsic objective for an
agent to pursue so that it can learn an optimal task-agnostic exploration
policy? In this paper, we argue that the entropy of the state distribution
induced by finite-horizon trajectories is a sensible target. Especially, we
present a novel and practical policy-search algorithm, Maximum Entropy POLicy
optimization (MEPOL), to learn a policy that maximizes a non-parametric,
$k$-nearest neighbors estimate of the state distribution entropy. In contrast
to known methods, MEPOL is completely model-free as it requires neither to
estimate the state distribution of any policy nor to model transition dynamics.
Then, we empirically show that MEPOL allows learning a maximum-entropy
exploration policy in high-dimensional, continuous-control domains, and how
this policy facilitates learning a variety of meaningful reward-based tasks
downstream.
Related papers
- Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - On the Global Convergence of Policy Gradient in Average Reward Markov
Decision Processes [50.68789924454235]
We present the first finite time global convergence analysis of policy gradient in the context of average reward Markov decision processes (MDPs)
Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $Oleft(frac1Tright),$ which translates to $Oleft(log(T)right)$ regret, where $T$ represents the number of iterations.
arXiv Detail & Related papers (2024-03-11T15:25:03Z) - Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces.
We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories.
We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z) - Matryoshka Policy Gradient for Entropy-Regularized RL: Convergence and Global Optimality [0.5261718469769449]
A novel Policy Gradient (PG) algorithm, called $textitMatryoshka Policy Gradient$ (MPG), is introduced and studied.
We prove uniqueness and characterize the optimal policy of the entropy regularized objective, together with global convergence of MPG.
As a proof of concept, we evaluate numerically MPG on standard test benchmarks.
arXiv Detail & Related papers (2023-03-22T17:56:18Z) - Latent State Marginalization as a Low-cost Approach for Improving
Exploration [79.12247903178934]
We propose the adoption of latent variable policies within the MaxEnt framework.
We show that latent variable policies naturally emerges under the use of world models with a latent belief state.
We experimentally validate our method on continuous control tasks, showing that effective marginalization can lead to better exploration and more robust training.
arXiv Detail & Related papers (2022-10-03T15:09:12Z) - Sigmoidally Preconditioned Off-policy Learning:a new exploration method
for reinforcement learning [14.991913317341417]
We focus on an off-policy Actor-Critic architecture, and propose a novel method, called Preconditioned Proximal Policy Optimization (P3O)
P3O can control the high variance of importance sampling by applying a preconditioner to the Conservative Policy Iteration (CPI) objective.
Results show that our P3O maximizes the CPI objective better than PPO during the training process.
arXiv Detail & Related papers (2022-05-20T09:38:04Z) - Beyond the Policy Gradient Theorem for Efficient Policy Updates in
Actor-Critic Algorithms [10.356356383401566]
In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states.
We discover that the policy gradient theorem prescribes policy updates that are slow to unlearn because of their structural symmetry with respect to the value target.
We introduce a modified policy update devoid of that flaw, and prove its guarantees of convergence to global optimality in $mathcalO(t-1)$ under classic assumptions.
arXiv Detail & Related papers (2022-02-15T15:04:10Z) - A Study of Policy Gradient on a Class of Exactly Solvable Models [35.90565839381652]
We explore the evolution of the policy parameters, for a special class of exactly solvable POMDPs, as a continuous-state Markov chain.
Our approach relies heavily on random walk theory, specifically on affine Weyl groups.
We analyze the probabilistic convergence of policy gradient to different local maxima of the value function.
arXiv Detail & Related papers (2020-11-03T17:27:53Z) - Global Convergence of Policy Gradient for Linear-Quadratic Mean-Field
Control/Game in Continuous Time [109.06623773924737]
We study the policy gradient method for the linear-quadratic mean-field control and game.
We show that it converges to the optimal solution at a linear rate, which is verified by a synthetic simulation.
arXiv Detail & Related papers (2020-08-16T06:34:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.