Block Policy Mirror Descent
- URL: http://arxiv.org/abs/2201.05756v1
- Date: Sat, 15 Jan 2022 04:42:02 GMT
- Title: Block Policy Mirror Descent
- Authors: Guanghui Lan, Yan Li, Tuo Zhao
- Abstract summary: We present a new class of policy (PG) methods, namely the block policy mirror descent (BPMD)
BPMD is used to solve a class of regularized reinforcement learning (RL) problems with strongly convex regularizers.
This is the first time that block coordinate descent methods have been developed and analyzed for policy optimization in reinforcement learning.
- Score: 40.2022466644885
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this paper, we present a new class of policy gradient (PG) methods, namely
the block policy mirror descent (BPMD) methods for solving a class of
regularized reinforcement learning (RL) problems with (strongly) convex
regularizers. Compared to the traditional PG methods with batch update rule,
which visit and update the policy for every state, BPMD methods have cheap
per-iteration computation via a partial update rule that performs the policy
update on a sampled state. Despite the nonconvex nature of the problem and a
partial update rule, BPMD methods achieve fast linear convergence to the global
optimality. We further extend BPMD methods to the stochastic setting, by
utilizing stochastic first-order information constructed from samples. We
establish $\cO(1/\epsilon)$ (resp. $\cO(1/\epsilon^2)$) sample complexity for
the strongly convex (resp. non-strongly convex) regularizers, with different
procedures for constructing the stochastic first-order information, where
$\epsilon$ denotes the target accuracy. To the best of our knowledge, this is
the first time that block coordinate descent methods have been developed and
analyzed for policy optimization in reinforcement learning.
Related papers
- Achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ Regret in Average-Reward POMDPs with Known Observation Models [56.92178753201331]
We tackle average-reward infinite-horizon POMDPs with an unknown transition model.
We present a novel and simple estimator that overcomes this barrier.
arXiv Detail & Related papers (2025-01-30T22:29:41Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - Stochastic Dimension-reduced Second-order Methods for Policy
Optimization [11.19708535159457]
We propose several new second-order algorithms for policy optimization that only require gradient and Hessian-vector product in each iteration.
Specifically, we propose a dimension-reduced second-order method (DR-SOPO) which repeatedly solves a projected two-dimensional trust region subproblem.
We show that DR-SOPO obtains an $mathcalO(epsilon-3.5)$ complexity for reaching approximate first-order stationary condition.
In addition, we present an enhanced algorithm (DVR-SOPO) which further improves the complexity to $mathcalO
arXiv Detail & Related papers (2023-01-28T12:09:58Z) - First-order Policy Optimization for Robust Markov Decision Process [40.2022466644885]
We consider the problem of solving robust Markov decision process (MDP)
MDP involves a set of discounted, finite state, finite action space MDPs with uncertain transition kernels.
For $(mathbfs,mathbfa)$-rectangular uncertainty sets, we establish several structural observations on the robust objective.
arXiv Detail & Related papers (2022-09-21T18:10:28Z) - Homotopic Policy Mirror Descent: Policy Convergence, Implicit
Regularization, and Improved Sample Complexity [40.2022466644885]
homotopic policy mirror descent (HPMD) method for solving discounted, infinite horizon MDPs with finite state and action space.
We report three properties that seem to be new in the literature of policy gradient methods.
arXiv Detail & Related papers (2022-01-24T04:54:58Z) - Bregman Gradient Policy Optimization [97.73041344738117]
We design a Bregman gradient policy optimization for reinforcement learning based on Bregman divergences and momentum techniques.
VR-BGPO reaches the best complexity $tilde(epsilon-3)$ for finding an $epsilon$stationary point only requiring one trajectory at each iteration.
arXiv Detail & Related papers (2021-06-23T01:08:54Z) - Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning.
We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z) - On the Convergence and Sample Efficiency of Variance-Reduced Policy
Gradient Method [38.34416337932712]
Policy gives rise to a rich class of reinforcement learning (RL) methods, for example the REINFORCE.
Yet the best known sample complexity result for such methods to find an $epsilon$-optimal policy is $mathcalO(epsilon-3)$, which is suboptimal.
We study the fundamental convergence properties and sample efficiency of first-order policy optimization method.
arXiv Detail & Related papers (2021-02-17T07:06:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.