Related papers: Generalised Policy Improvement with Geometric Policy Composition

Generalised Policy Improvement with Geometric Policy Composition

URL: http://arxiv.org/abs/2206.08736v1
Date: Fri, 17 Jun 2022 12:52:13 GMT
Title: Generalised Policy Improvement with Geometric Policy Composition
Authors: Shantanu Thakoor, Mark Rowland, Diana Borsa, Will Dabney, R\'emi Munos, Andr\'e Barreto
Abstract summary: We introduce a method for policy improvement that interpolates between the greedy approach of value-based reinforcement learning (RL) and the full planning approach typical of model-based RL. We show that we can evaluate any non-Markov policy that switches between a set of base Markov policies with fixed probability by a careful composition of the base policy GHMs. We can then apply generalised policy improvement (GPI) to collections of such non-Markov policies to obtain a new Markov policy that will in general outperform its precursors.
Score: 18.80807234471197
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce a method for policy improvement that interpolates between the greedy approach of value-based reinforcement learning (RL) and the full planning approach typical of model-based RL. The new method builds on the concept of a geometric horizon model (GHM, also known as a gamma-model), which models the discounted state-visitation distribution of a given policy. We show that we can evaluate any non-Markov policy that switches between a set of base Markov policies with fixed probability by a careful composition of the base policy GHMs, without any additional learning. We can then apply generalised policy improvement (GPI) to collections of such non-Markov policies to obtain a new Markov policy that will in general outperform its precursors. We provide a thorough theoretical analysis of this approach, develop applications to transfer and standard RL, and empirically demonstrate its effectiveness over standard GPI on a challenging deep RL continuous control task. We also provide an analysis of GHM training methods, proving a novel convergence result regarding previously proposed methods and showing how to train these models stably in deep RL settings.

Related papers

Hierarchical Reinforcement Learning with Uncertainty-Guided Diffusional Subgoals [12.894271401094615]
A key challenge in HRL is that the low-level policy changes over time, making it difficult for the high-level policy to generate effective subgoals.<n>We propose an approach that trains a conditional diffusion model regularized by a Gaussian Process (GP) prior to generate a complex variety of subgoals.<n>Building on this framework, we develop a strategy that selects subgoals from both the diffusion policy and GP's predictive mean.
arXiv Detail & Related papers (2025-05-27T20:38:44Z)
Improving Controller Generalization with Dimensionless Markov Decision Processes [6.047438841182958]
We propose a Model-Based approach to increase generalization where both world model and policy are trained in a dimensionless state-action space. We demonstrate the applicability of our method on simulated actuated pendulum and cartpole systems, where policies trained on a single environment are robust to shifts in the distribution of the context.
arXiv Detail & Related papers (2025-04-14T09:08:53Z)
RL-finetuning LLMs from on- and off-policy data with a single algorithm [53.70731390624718]
We introduce a novel reinforcement learning algorithm (AGRO) for fine-tuning large-language models. AGRO leverages the concept of generation consistency, which states that the optimal policy satisfies the notion of consistency across any possible generation of the model. We derive algorithms that find optimal solutions via the sample-based policy gradient and provide theoretical guarantees on their convergence.
arXiv Detail & Related papers (2025-03-25T12:52:38Z)
Landscape of Policy Optimization for Finite Horizon MDPs with General State and Action [10.219627570276689]
We develop a framework for a class of Markov Decision Processes with general state and spaces. We show that gradient methods converge to the globally optimal policy with a nonasymptomatic condition. Our result establishes first complexity for multi-period inventory systems.
arXiv Detail & Related papers (2024-09-25T17:56:02Z)
Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions. We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z)
Theoretically Guaranteed Policy Improvement Distilled from Model-Based Planning [64.10794426777493]
Model-based reinforcement learning (RL) has demonstrated remarkable successes on a range of continuous control tasks. Recent practices tend to distill optimized action sequences into an RL policy during the training phase. We develop an approach to distill from model-based planning to the policy.
arXiv Detail & Related papers (2023-07-24T16:52:31Z)
Model-based Offline Reinforcement Learning with Local Misspecification [35.75701143290119]
We present a model-based offline reinforcement learning policy performance lower bound that explicitly captures dynamics model misspecification and distribution mismatch. We propose an empirical algorithm for optimal offline policy selection.
arXiv Detail & Related papers (2023-01-26T21:26:56Z)
Model-Based Offline Meta-Reinforcement Learning with Regularization [63.35040401948943]
offline Meta-RL is emerging as a promising approach to address these challenges. MerPO learns a meta-model for efficient task structure inference and an informative meta-policy. We show that MerPO offers guaranteed improvement over both the behavior policy and the meta-policy.
arXiv Detail & Related papers (2022-02-07T04:15:20Z)
Policy Mirror Descent for Regularized Reinforcement Learning: A Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL. We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z)
MPC-based Reinforcement Learning for Economic Problems with Application to Battery Storage [0.0]
We focus on policy approximations based on Model Predictive Control (MPC) We observe that the policy gradient method can struggle to produce meaningful steps in the policy parameters when the policy has a (nearly) bang-bang structure. We propose a homotopy strategy based on the interior-point method, providing a relaxation of the policy during the learning.
arXiv Detail & Related papers (2021-04-06T10:37:14Z)
Improving Actor-Critic Reinforcement Learning via Hamiltonian Policy [11.34520632697191]
Approximating optimal policies in reinforcement learning (RL) is often necessary in many real-world scenarios. In this work, inspired by the previous use of Hamiltonian Monte Carlo (HMC) in VI, we propose to integrate policy optimization with HMC. We show that the proposed approach is a data-efficient, and an easy-to-implement improvement over previous policy optimization methods.
arXiv Detail & Related papers (2021-03-22T17:26:43Z)
COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable. We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions. We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z)
A Study of Policy Gradient on a Class of Exactly Solvable Models [35.90565839381652]
We explore the evolution of the policy parameters, for a special class of exactly solvable POMDPs, as a continuous-state Markov chain. Our approach relies heavily on random walk theory, specifically on affine Weyl groups. We analyze the probabilistic convergence of policy gradient to different local maxima of the value function.
arXiv Detail & Related papers (2020-11-03T17:27:53Z)
MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data. We show that an existing model-based RL algorithm already produces significant gains in the offline setting. We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.