Related papers: Beyond Exact Gradients: Convergence of Stochastic Soft-Max Policy Gradient Methods with Entropy Regularization

Beyond Exact Gradients: Convergence of Stochastic Soft-Max Policy Gradient Methods with Entropy Regularization

URL: http://arxiv.org/abs/2110.10117v3
Date: Sat, 13 Jul 2024 20:15:46 GMT
Title: Beyond Exact Gradients: Convergence of Stochastic Soft-Max Policy Gradient Methods with Entropy Regularization
Authors: Yuhao Ding, Junzi Zhang, Hyunin Lee, Javad Lavaei,
Abstract summary: We revisit the classical entropy regularized policy gradient methods with the soft-max policy parametrization. We establish a global optimality convergence result and a sample complexity of $widetildemathcalO(frac1epsilon2)$ for the proposed algorithm.
Score: 20.651913793555163
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Entropy regularization is an efficient technique for encouraging exploration and preventing a premature convergence of (vanilla) policy gradient methods in reinforcement learning (RL). However, the theoretical understanding of entropy-regularized RL algorithms has been limited. In this paper, we revisit the classical entropy regularized policy gradient methods with the soft-max policy parametrization, whose convergence has so far only been established assuming access to exact gradient oracles. To go beyond this scenario, we propose the first set of (nearly) unbiased stochastic policy gradient estimators with trajectory-level entropy regularization, with one being an unbiased visitation measure-based estimator and the other one being a nearly unbiased yet more practical trajectory-based estimator. We prove that although the estimators themselves are unbounded in general due to the additional logarithmic policy rewards introduced by the entropy term, the variances are uniformly bounded. We then propose a two-phase stochastic policy gradient (PG) algorithm that uses a large batch size in the first phase to overcome the challenge of the stochastic approximation due to the non-coercive landscape, and uses a small batch size in the second phase by leveraging the curvature information around the optimal policy. We establish a global optimality convergence result and a sample complexity of $\widetilde{\mathcal{O}}(\frac{1}{\epsilon^2})$ for the proposed algorithm. Our result is the first global convergence and sample complexity results for the stochastic entropy-regularized vanilla PG method.

Related papers

Ordering-based Conditions for Global Convergence of Policy Gradient Methods [73.6366483406033]
We prove that, for finite-arm bandits with linear function approximation, the global convergence of policy gradient (PG) methods depends on inter-related properties between the policy update and the representation. Overall, these observations call into question approximation error as an appropriate quantity for characterizing the global convergence of PG methods under linear function approximation.
arXiv Detail & Related papers (2025-04-02T21:06:28Z)
Achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ Regret in Average-Reward POMDPs with Known Observation Models [56.92178753201331]
We tackle average-reward infinite-horizon POMDPs with an unknown transition model. We present a novel and simple estimator that overcomes this barrier.
arXiv Detail & Related papers (2025-01-30T22:29:41Z)
Deterministic Policy Gradient Primal-Dual Methods for Continuous-Space Constrained MDPs [82.34567890576423]
We develop a deterministic policy gradient primal-dual method to find an optimal deterministic policy with non-asymptotic convergence. We prove that the primal-dual iterates of D-PGPD converge at a sub-linear rate to an optimal regularized primal-dual pair. To the best of our knowledge, this appears to be the first work that proposes a deterministic policy search method for continuous-space constrained MDPs.
arXiv Detail & Related papers (2024-08-19T14:11:04Z)
Full error analysis of policy gradient learning algorithms for exploratory linear quadratic mean-field control problem in continuous time with common noise [0.0]
We study policy gradient (PG) learning and first demonstrate convergence in a model-based setting. We prove the global linear convergence and sample complexity of the PG algorithm with two-point gradient estimates in a model-free setting. In this setting, the parameterized optimal policies are learned from samples of the states and population distribution.
arXiv Detail & Related papers (2024-08-05T14:11:51Z)
Optimal Rates of Convergence for Entropy Regularization in Discounted Markov Decision Processes [4.714840786221651]
We study the error introduced by entropy regularization in Markov decision processes.<n>We show that this error decreases exponentially in the inverse regularization strength.<n>We extend our analysis to settings beyond the entropy.
arXiv Detail & Related papers (2024-06-06T15:20:37Z)
Entropy annealing for policy mirror descent in continuous time and space [2.8255028200738455]
We study the impact of entropy regularization on the convergence of policy methods for exit time control problems. This paper explains how entropy regularization improves policy optimization, even with the true gradient.
arXiv Detail & Related papers (2024-05-30T17:02:18Z)
Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP) We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy. To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z)
Linear Convergence of Natural Policy Gradient Methods with Log-Linear Policies [115.86431674214282]
We consider infinite-horizon discounted Markov decision processes and study the convergence rates of the natural policy gradient (NPG) and the Q-NPG methods with the log-linear policy class. We show that both methods attain linear convergence rates and $mathcalO (1/epsilon2)$ sample complexities using a simple, non-adaptive geometrically increasing step size.
arXiv Detail & Related papers (2022-10-04T06:17:52Z)
Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs [21.347689976296834]
We employ the natural policy gradient method to solve the discounted optimal optimal rate problem. We also provide convergence and finite-sample guarantees for two sample-based NPG-PD algorithms.
arXiv Detail & Related papers (2022-06-06T04:28:04Z)
On the Convergence Rates of Policy Gradient Methods [9.74841674275568]
We consider geometrically discounted dominance problems with finite state sub spaces. We show that with direct gradient pararization in a sample we can analyze the general complexity of a gradient.
arXiv Detail & Related papers (2022-01-19T07:03:37Z)
Convergence of policy gradient for entropy regularized MDPs with neural network approximation in the mean-field regime [0.0]
We study the global convergence of policy gradient for infinite-horizon, continuous state and action space, entropy-regularized Markov decision processes (MDPs) Our results rely on the careful analysis of non-linear Fokker--Planck--Kolmogorov equation.
arXiv Detail & Related papers (2022-01-18T20:17:16Z)
Nonconvex Stochastic Scaled-Gradient Descent and Generalized Eigenvector Problems [98.34292831923335]
Motivated by the problem of online correlation analysis, we propose the emphStochastic Scaled-Gradient Descent (SSD) algorithm. We bring these ideas together in an application to online correlation analysis, deriving for the first time an optimal one-time-scale algorithm with an explicit rate of local convergence to normality.
arXiv Detail & Related papers (2021-12-29T18:46:52Z)
On the Convergence of Stochastic Extragradient for Bilinear Games with Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence. We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z)
High Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise [51.31435087414348]
It is essential to theoretically guarantee that algorithms provide small objective residual with high probability. Existing methods for non-smooth convex optimization have complexity bounds with dependence on confidence level. We propose novel stepsize rules for two methods with gradient clipping.
arXiv Detail & Related papers (2021-06-10T17:54:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.