Related papers: Bridging the Gap between Newton-Raphson Method and Regularized Policy Iteration

Bridging the Gap between Newton-Raphson Method and Regularized Policy Iteration

URL: http://arxiv.org/abs/2310.07211v1
Date: Wed, 11 Oct 2023 05:55:20 GMT
Title: Bridging the Gap between Newton-Raphson Method and Regularized Policy Iteration
Authors: Zeyang Li, Chuxiong Hu, Yunan Wang, Guojian Zhan, Jie Li, Shengbo Eben Li
Abstract summary: We show that regularized policy iteration is strictly equivalent to the standard Newton-Raphson method in the condition of smoothing out Bellman equation with strongly convex functions. We prove that regularized policy iteration has global linear convergence with the rate being $gamma$ (discount factor) We also show that a modified version of regularized policy iteration, i.e., with finite-step policy evaluation, is equivalent to inexact Newton method where the Newton iteration formula is solved with truncated iterations.
Score: 13.166738075816493
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Regularization is one of the most important techniques in reinforcement learning algorithms. The well-known soft actor-critic algorithm is a special case of regularized policy iteration where the regularizer is chosen as Shannon entropy. Despite some empirical success of regularized policy iteration, its theoretical underpinnings remain unclear. This paper proves that regularized policy iteration is strictly equivalent to the standard Newton-Raphson method in the condition of smoothing out Bellman equation with strongly convex functions. This equivalence lays the foundation of a unified analysis for both global and local convergence behaviors of regularized policy iteration. We prove that regularized policy iteration has global linear convergence with the rate being $\gamma$ (discount factor). Furthermore, this algorithm converges quadratically once it enters a local region around the optimal value. We also show that a modified version of regularized policy iteration, i.e., with finite-step policy evaluation, is equivalent to inexact Newton method where the Newton iteration formula is solved with truncated iterations. We prove that the associated algorithm achieves an asymptotic linear convergence rate of $\gamma^M$ in which $M$ denotes the number of steps carried out in policy evaluation. Our results take a solid step towards a better understanding of the convergence properties of regularized policy iteration algorithms.

Related papers

The Reinforce Policy Gradient Algorithm Revisited [7.894349646617293]
We revisit the Reinforce policy gradient algorithm from the literature. We propose a major enhancement to the basic algorithm. We provide a proof of convergence for this new algorithm.
arXiv Detail & Related papers (2023-10-08T04:05:13Z)
Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP) We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy. To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z)
On the Linear Convergence of Policy Gradient under Hadamard Parameterization [4.182089296199263]
We study the convergence of deterministic policy gradient under the Hadamard parameterization. We show that the error decreases at an $O(frac1k)$ rate for all the iterations.
arXiv Detail & Related papers (2023-05-31T05:51:15Z)
A Cubic-regularized Policy Newton Algorithm for Reinforcement Learning [9.628032156001073]
We propose two policy Newton algorithms that incorporate cubic regularization. Both algorithms employ the likelihood ratio method to form estimates of the gradient and Hessian of the value function. In particular, the sample complexity of our algorithms to find an $epsilon$-SOSP is $O(epsilon-3.5)$, which is an improvement over the state-of-the-art sample complexity of $O(epsilon-4.5)$.
arXiv Detail & Related papers (2023-04-21T13:43:06Z)
A New Policy Iteration Algorithm For Reinforcement Learning in Zero-Sum Markov Games [10.805520579293747]
We show that a simple variant of naive policy iteration for games converges exponentially fast. We also show that lookahead policies can be implemented efficiently in the function approximation setting of linear Markov games.
arXiv Detail & Related papers (2023-03-17T01:20:22Z)
Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy. Many algorithms for IRL have an inherently nested structure. We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z)
Globally Convergent Policy Search over Dynamic Filters for Output Estimation [64.90951294952094]
We introduce the first direct policy search algorithm convex which converges to the globally optimal $textitdynamic$ filter. We show that informativity overcomes the aforementioned degeneracy.
arXiv Detail & Related papers (2022-02-23T18:06:20Z)
Homotopic Policy Mirror Descent: Policy Convergence, Implicit Regularization, and Improved Sample Complexity [40.2022466644885]
homotopic policy mirror descent (HPMD) method for solving discounted, infinite horizon MDPs with finite state and action space. We report three properties that seem to be new in the literature of policy gradient methods.
arXiv Detail & Related papers (2022-01-24T04:54:58Z)
Bregman Gradient Policy Optimization [97.73041344738117]
We design a Bregman gradient policy optimization for reinforcement learning based on Bregman divergences and momentum techniques. VR-BGPO reaches the best complexity $tilde(epsilon-3)$ for finding an $epsilon$stationary point only requiring one trajectory at each iteration.
arXiv Detail & Related papers (2021-06-23T01:08:54Z)
Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs. bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad. We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z)
Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis [102.29671176698373]
We address the problem of policy evaluation in discounted decision processes, and provide Markov-dependent guarantees on the $ell_infty$error under a generative model. We establish both and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms.
arXiv Detail & Related papers (2020-03-16T17:15:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.