Related papers: Convergence of policy gradient for entropy regularized MDPs with neural network approximation in the mean-field regime

Convergence of policy gradient for entropy regularized MDPs with neural network approximation in the mean-field regime

URL: http://arxiv.org/abs/2201.07296v1
Date: Tue, 18 Jan 2022 20:17:16 GMT
Title: Convergence of policy gradient for entropy regularized MDPs with neural network approximation in the mean-field regime
Authors: Bekzhan Kerimkulov and James-Michael Leahy and David \v{S}i\v{s}ka and Lukasz Szpruch
Abstract summary: We study the global convergence of policy gradient for infinite-horizon, continuous state and action space, entropy-regularized Markov decision processes (MDPs) Our results rely on the careful analysis of non-linear Fokker--Planck--Kolmogorov equation.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the global convergence of policy gradient for infinite-horizon, continuous state and action space, entropy-regularized Markov decision processes (MDPs). We consider a softmax policy with (one-hidden layer) neural network approximation in a mean-field regime. Additional entropic regularization in the associated mean-field probability measure is added, and the corresponding gradient flow is studied in the 2-Wasserstein metric. We show that the objective function is increasing along the gradient flow. Further, we prove that if the regularization in terms of the mean-field measure is sufficient, the gradient flow converges exponentially fast to the unique stationary solution, which is the unique maximizer of the regularized MDP objective. Lastly, we study the sensitivity of the value function along the gradient flow with respect to regularization parameters and the initial condition. Our results rely on the careful analysis of non-linear Fokker--Planck--Kolmogorov equation and extend the pioneering work of Mei et al. 2020 and Agarwal et al. 2020, which quantify the global convergence rate of policy gradient for entropy-regularized MDPs in the tabular setting.

Related papers

Ordering-based Conditions for Global Convergence of Policy Gradient Methods [73.6366483406033]
We prove that, for finite-arm bandits with linear function approximation, the global convergence of policy gradient (PG) methods depends on inter-related properties between the policy update and the representation. Overall, these observations call into question approximation error as an appropriate quantity for characterizing the global convergence of PG methods under linear function approximation.
arXiv Detail & Related papers (2025-04-02T21:06:28Z)
Deterministic Policy Gradient Primal-Dual Methods for Continuous-Space Constrained MDPs [82.34567890576423]
We develop a deterministic policy gradient primal-dual method to find an optimal deterministic policy with non-asymptotic convergence. We prove that the primal-dual iterates of D-PGPD converge at a sub-linear rate to an optimal regularized primal-dual pair. To the best of our knowledge, this appears to be the first work that proposes a deterministic policy search method for continuous-space constrained MDPs.
arXiv Detail & Related papers (2024-08-19T14:11:04Z)
Entropy annealing for policy mirror descent in continuous time and space [2.8255028200738455]
We study the impact of entropy regularization on the convergence of policy methods for exit time control problems. This paper explains how entropy regularization improves policy optimization, even with the true gradient.
arXiv Detail & Related papers (2024-05-30T17:02:18Z)
A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks. We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks. Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z)
On the Global Convergence of Policy Gradient in Average Reward Markov Decision Processes [50.68789924454235]
We present the first finite time global convergence analysis of policy gradient in the context of average reward Markov decision processes (MDPs) Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $Oleft(frac1Tright),$ which translates to $Oleft(log(T)right)$ regret, where $T$ represents the number of iterations.
arXiv Detail & Related papers (2024-03-11T15:25:03Z)
A Fisher-Rao gradient flow for entropy-regularised Markov decision processes in Polish spaces [10.777806006475297]
We study the global convergence of a Fisher-Rao policy gradient flow for infinite-horizon entropy-regularised Markov decision processes with Polish state and action space. We establish the global well-posedness of the gradient flow and demonstrate its exponential convergence to the optimal policy.
arXiv Detail & Related papers (2023-10-04T16:41:36Z)
Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP) We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy. To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z)
Convergence of mean-field Langevin dynamics: Time and space discretization, stochastic gradient, and variance reduction [49.66486092259376]
The mean-field Langevin dynamics (MFLD) is a nonlinear generalization of the Langevin dynamics that incorporates a distribution-dependent drift. Recent works have shown that MFLD globally minimizes an entropy-regularized convex functional in the space of measures. We provide a framework to prove a uniform-in-time propagation of chaos for MFLD that takes into account the errors due to finite-particle approximation, time-discretization, and gradient approximation.
arXiv Detail & Related papers (2023-06-12T16:28:11Z)
On the Convergence Rates of Policy Gradient Methods [9.74841674275568]
We consider geometrically discounted dominance problems with finite state sub spaces. We show that with direct gradient pararization in a sample we can analyze the general complexity of a gradient.
arXiv Detail & Related papers (2022-01-19T07:03:37Z)
Beyond Exact Gradients: Convergence of Stochastic Soft-Max Policy Gradient Methods with Entropy Regularization [20.651913793555163]
We revisit the classical entropy regularized policy gradient methods with the soft-max policy parametrization. We establish a global optimality convergence result and a sample complexity of $widetildemathcalO(frac1epsilon2)$ for the proposed algorithm.
arXiv Detail & Related papers (2021-10-19T17:21:09Z)
A Near-Optimal Gradient Flow for Learning Neural Energy-Based Models [93.24030378630175]
We propose a novel numerical scheme to optimize the gradient flows for learning energy-based models (EBMs) We derive a second-order Wasserstein gradient flow of the global relative entropy from Fokker-Planck equation. Compared with existing schemes, Wasserstein gradient flow is a smoother and near-optimal numerical scheme to approximate real data densities.
arXiv Detail & Related papers (2019-10-31T02:26:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.