A Novel Framework for Policy Mirror Descent with General
Parameterization and Linear Convergence
- URL: http://arxiv.org/abs/2301.13139v4
- Date: Tue, 13 Feb 2024 17:18:16 GMT
- Title: A Novel Framework for Policy Mirror Descent with General
Parameterization and Linear Convergence
- Authors: Carlo Alfano, Rui Yuan, Patrick Rebeschini
- Abstract summary: We introduce a novel framework for policy optimization based on mirror descent.
We obtain the first result that guarantees linear convergence for a policy-gradient-based method involving general parameterization.
- Score: 15.807079236265714
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern policy optimization methods in reinforcement learning, such as TRPO
and PPO, owe their success to the use of parameterized policies. However, while
theoretical guarantees have been established for this class of algorithms,
especially in the tabular setting, the use of general parameterization schemes
remains mostly unjustified. In this work, we introduce a novel framework for
policy optimization based on mirror descent that naturally accommodates general
parameterizations. The policy class induced by our scheme recovers known
classes, e.g., softmax, and generates new ones depending on the choice of
mirror map. Using our framework, we obtain the first result that guarantees
linear convergence for a policy-gradient-based method involving general
parameterization. To demonstrate the ability of our framework to accommodate
general parameterization schemes, we provide its sample complexity when using
shallow neural networks, show that it represents an improvement upon the
previous best results, and empirically validate the effectiveness of our
theoretical claims on classic control tasks.
Related papers
- Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback [106.63518036538163]
We present a novel unified bilevel optimization-based framework, textsfPARL, formulated to address the recently highlighted critical issue of policy alignment in reinforcement learning.
Our framework addressed these concerns by explicitly parameterizing the distribution of the upper alignment objective (reward design) by the lower optimal variable.
Our empirical results substantiate that the proposed textsfPARL can address the alignment concerns in RL by showing significant improvements.
arXiv Detail & Related papers (2023-08-03T18:03:44Z) - On Optimal Regularization Parameters via Bilevel Learning [0.06213771671016098]
We provide a new condition that better characterizes positivity of optimal regularization parameters than the existing theory.
Numerical results verify and explore this new condition for both small and high-dimensional problems.
arXiv Detail & Related papers (2023-05-28T12:34:07Z) - A Parametric Class of Approximate Gradient Updates for Policy
Optimization [47.69337420768319]
We develop a unified perspective that re-expresses the underlying updates in terms of a limited choice of gradient form and scaling function.
We obtain novel yet well motivated updates that generalize existing algorithms in a way that can deliver benefits both in terms of convergence speed and final result quality.
arXiv Detail & Related papers (2022-06-17T01:28:38Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z) - Near Optimal Policy Optimization via REPS [33.992374484681704]
emphrelative entropy policy search (REPS) has demonstrated successful policy learning on a number of simulated and real-world robotic domains.
There exist no guarantees on REPS's performance when using gradient-based solvers.
We introduce a technique that uses emphgenerative access to the underlying decision process to compute parameter updates that maintain favorable convergence to the optimal regularized policy.
arXiv Detail & Related papers (2021-03-17T16:22:59Z) - Structured Policy Iteration for Linear Quadratic Regulator [40.52288246664592]
We introduce the textitStructured Policy Iteration (S-PI) for LQR, a method capable of deriving a structured linear policy.
Such a structured policy with (block) sparsity or low-rank can have significant advantages over the standard LQR policy.
In both the known-model and model-free setting, we prove convergence analysis under the proper choice of parameters.
arXiv Detail & Related papers (2020-07-13T06:03:15Z) - Neural Proximal/Trust Region Policy Optimization Attains Globally
Optimal Policy [119.12515258771302]
We show that a variant of PPOO equipped with over-parametrization converges to globally optimal networks.
The key to our analysis is the iterate of infinite gradient under a notion of one-dimensional monotonicity, where the gradient and are instant by networks.
arXiv Detail & Related papers (2019-06-25T03:20:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.