Generalised Policy Improvement with Geometric Policy Composition
- URL: http://arxiv.org/abs/2206.08736v1
- Date: Fri, 17 Jun 2022 12:52:13 GMT
- Title: Generalised Policy Improvement with Geometric Policy Composition
- Authors: Shantanu Thakoor, Mark Rowland, Diana Borsa, Will Dabney, R\'emi
Munos, Andr\'e Barreto
- Abstract summary: We introduce a method for policy improvement that interpolates between the greedy approach of value-based reinforcement learning (RL) and the full planning approach typical of model-based RL.
We show that we can evaluate any non-Markov policy that switches between a set of base Markov policies with fixed probability by a careful composition of the base policy GHMs.
We can then apply generalised policy improvement (GPI) to collections of such non-Markov policies to obtain a new Markov policy that will in general outperform its precursors.
- Score: 18.80807234471197
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a method for policy improvement that interpolates between the
greedy approach of value-based reinforcement learning (RL) and the full
planning approach typical of model-based RL. The new method builds on the
concept of a geometric horizon model (GHM, also known as a gamma-model), which
models the discounted state-visitation distribution of a given policy. We show
that we can evaluate any non-Markov policy that switches between a set of base
Markov policies with fixed probability by a careful composition of the base
policy GHMs, without any additional learning. We can then apply generalised
policy improvement (GPI) to collections of such non-Markov policies to obtain a
new Markov policy that will in general outperform its precursors. We provide a
thorough theoretical analysis of this approach, develop applications to
transfer and standard RL, and empirically demonstrate its effectiveness over
standard GPI on a challenging deep RL continuous control task. We also provide
an analysis of GHM training methods, proving a novel convergence result
regarding previously proposed methods and showing how to train these models
stably in deep RL settings.
Related papers
- Landscape of Policy Optimization for Finite Horizon MDPs with General State and Action [10.219627570276689]
We develop a framework for a class of Markov Decision Processes with general state and spaces.
We show that gradient methods converge to the globally optimal policy with a nonasymptomatic condition.
Our result establishes first complexity for multi-period inventory systems.
arXiv Detail & Related papers (2024-09-25T17:56:02Z) - Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Theoretically Guaranteed Policy Improvement Distilled from Model-Based
Planning [64.10794426777493]
Model-based reinforcement learning (RL) has demonstrated remarkable successes on a range of continuous control tasks.
Recent practices tend to distill optimized action sequences into an RL policy during the training phase.
We develop an approach to distill from model-based planning to the policy.
arXiv Detail & Related papers (2023-07-24T16:52:31Z) - Model-based Offline Reinforcement Learning with Local Misspecification [35.75701143290119]
We present a model-based offline reinforcement learning policy performance lower bound that explicitly captures dynamics model misspecification and distribution mismatch.
We propose an empirical algorithm for optimal offline policy selection.
arXiv Detail & Related papers (2023-01-26T21:26:56Z) - Model-Based Offline Meta-Reinforcement Learning with Regularization [63.35040401948943]
offline Meta-RL is emerging as a promising approach to address these challenges.
MerPO learns a meta-model for efficient task structure inference and an informative meta-policy.
We show that MerPO offers guaranteed improvement over both the behavior policy and the meta-policy.
arXiv Detail & Related papers (2022-02-07T04:15:20Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z) - MPC-based Reinforcement Learning for Economic Problems with Application
to Battery Storage [0.0]
We focus on policy approximations based on Model Predictive Control (MPC)
We observe that the policy gradient method can struggle to produce meaningful steps in the policy parameters when the policy has a (nearly) bang-bang structure.
We propose a homotopy strategy based on the interior-point method, providing a relaxation of the policy during the learning.
arXiv Detail & Related papers (2021-04-06T10:37:14Z) - Improving Actor-Critic Reinforcement Learning via Hamiltonian Policy [11.34520632697191]
Approximating optimal policies in reinforcement learning (RL) is often necessary in many real-world scenarios.
In this work, inspired by the previous use of Hamiltonian Monte Carlo (HMC) in VI, we propose to integrate policy optimization with HMC.
We show that the proposed approach is a data-efficient, and an easy-to-implement improvement over previous policy optimization methods.
arXiv Detail & Related papers (2021-03-22T17:26:43Z) - COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable.
We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions.
We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z) - A Study of Policy Gradient on a Class of Exactly Solvable Models [35.90565839381652]
We explore the evolution of the policy parameters, for a special class of exactly solvable POMDPs, as a continuous-state Markov chain.
Our approach relies heavily on random walk theory, specifically on affine Weyl groups.
We analyze the probabilistic convergence of policy gradient to different local maxima of the value function.
arXiv Detail & Related papers (2020-11-03T17:27:53Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.