Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence
- URL: http://arxiv.org/abs/2105.11066v1
- Date: Mon, 24 May 2021 02:21:34 GMT
- Title: Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence
- Authors: Wenhao Zhan, Shicong Cen, Baihe Huang, Yuxin Chen, Jason D. Lee,
Yuejie Chi
- Abstract summary: This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
- Score: 60.20076757208645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Policy optimization, which learns the policy of interest by maximizing the
value function via large-scale optimization techniques, lies at the heart of
modern reinforcement learning (RL). In addition to value maximization, other
practical considerations arise commonly as well, including the need of
encouraging exploration, and that of ensuring certain structural properties of
the learned policy due to safety, resource and operational constraints. These
considerations can often be accounted for by resorting to regularized RL, which
augments the target value function with a structure-promoting regularization
term.
Focusing on an infinite-horizon discounted Markov decision process, this
paper proposes a generalized policy mirror descent (GPMD) algorithm for solving
regularized RL. As a generalization of policy mirror descent Lan (2021), the
proposed algorithm accommodates a general class of convex regularizers as well
as a broad family of Bregman divergence in cognizant of the regularizer in use.
We demonstrate that our algorithm converges linearly over an entire range of
learning rates, in a dimension-free fashion, to the global solution, even when
the regularizer lacks strong convexity and smoothness. In addition, this linear
convergence feature is provably stable in the face of inexact policy evaluation
and imperfect policy updates. Numerical experiments are provided to corroborate
the applicability and appealing performance of GPMD.
Related papers
- Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - A Novel Framework for Policy Mirror Descent with General
Parameterization and Linear Convergence [15.807079236265714]
We introduce a novel framework for policy optimization based on mirror descent.
We obtain the first result that guarantees linear convergence for a policy-gradient-based method involving general parameterization.
arXiv Detail & Related papers (2023-01-30T18:21:48Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Policy Optimization over General State and Action Spaces [3.722665817361884]
Reinforcement learning (RL) problems over general state and action spaces are notoriously challenging.
We first present a substantial generalization of the recently developed policy mirror descent method to deal with general state and action spaces.
We introduce new approaches to incorporate function approximation into this method, so that we do not need to use explicit policy parameterization at all.
arXiv Detail & Related papers (2022-11-30T03:44:44Z) - Policy Gradient for Reinforcement Learning with General Utilities [50.65940899590487]
In Reinforcement Learning (RL), the goal of agents is to discover an optimal policy that maximizes the expected cumulative rewards.
Many supervised and unsupervised RL problems are not covered in the Linear RL framework.
We derive the policy gradient theorem for RL with general utilities.
arXiv Detail & Related papers (2022-10-03T14:57:46Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z) - Fast Global Convergence of Natural Policy Gradient Methods with Entropy
Regularization [44.24881971917951]
Natural policy gradient (NPG) methods are among the most widely used policy optimization algorithms.
We develop convergence guarantees for entropy-regularized NPG methods under softmax parameterization.
Our results accommodate a wide range of learning rates, and shed light upon the role of entropy regularization in enabling fast convergence.
arXiv Detail & Related papers (2020-07-13T17:58:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.