Related papers: Beyond Softmax and Entropy: Improving Convergence Guarantees of Policy Gradients by f-SoftArgmax Parameterization with Coupled Regularization

Beyond Softmax and Entropy: Improving Convergence Guarantees of Policy Gradients by f-SoftArgmax Parameterization with Coupled Regularization

URL: http://arxiv.org/abs/2601.12604v1
Date: Sun, 18 Jan 2026 22:25:03 GMT
Title: Beyond Softmax and Entropy: Improving Convergence Guarantees of Policy Gradients by f-SoftArgmax Parameterization with Coupled Regularization
Authors: Safwan Labbi, Daniil Tiapkin, Paul Mangold, Eric Moulines,
Abstract summary: Policy gradient methods are highly sensitive to the choice of policy parameterization.<n>We propose replacing the softmax with an alternative policy parameterization based on the generalized f-softargmax.<n>We establish the first explicit non-asymptotic last-iterate convergence guarantees for finite MDPs without any form of preconditioning.
Score: 26.419221337454008
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Policy gradient methods are known to be highly sensitive to the choice of policy parameterization. In particular, the widely used softmax parameterization can induce ill-conditioned optimization landscapes and lead to exponentially slow convergence. Although this can be mitigated by preconditioning, this solution is often computationally expensive. Instead, we propose replacing the softmax with an alternative family of policy parameterizations based on the generalized f-softargmax. We further advocate coupling this parameterization with a regularizer induced by the same f-divergence, which improves the optimization landscape and ensures that the resulting regularized objective satisfies a Polyak-Lojasiewicz inequality. Leveraging this structure, we establish the first explicit non-asymptotic last-iterate convergence guarantees for stochastic policy gradient methods for finite MDPs without any form of preconditioning. We also derive sample-complexity bounds for the unregularized problem and show that f-PG, with Tsallis divergences achieves polynomial sample complexity in contrast to the exponential complexity incurred by the standard softmax parameterization.

Related papers

Convergence of Policy Mirror Descent Beyond Compatible Function Approximation [66.4260157478436]
We develop theoretical PMD general policy classes where we strictly assume a weaker variational dominance and obtain convergence to the best-in-class policy.<n>Our main notion leverages a novel notion induced by the local norm induced by the occupancy- gradient measure.
arXiv Detail & Related papers (2025-02-16T08:05:46Z)
Beyond Stationarity: Convergence Analysis of Stochastic Softmax Policy Gradient Methods [0.40964539027092917]
Markov Decision Processes (MDPs) are a formal framework for modeling and solving sequential decision-making problems. In practice all parameters are often trained simultaneously, ignoring the inherent structure suggested by dynamic programming. This paper introduces a combination of dynamic programming and policy gradient called dynamic policy gradient, where the parameters are trained backwards in time.
arXiv Detail & Related papers (2023-10-04T09:21:01Z)
Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP) We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy. To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z)
Inference on Optimal Dynamic Policies via Softmax Approximation [27.396891119011215]
We show that a simple soft-max approximation to the optimal treatment regime can achieve valid inference on the truly optimal regime. Our work combines techniques from semi-parametric inference and $g$-estimation, together with an appropriate array central limit theorem.
arXiv Detail & Related papers (2023-03-08T07:42:47Z)
Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs [23.596546979904613]
We employ the natural policy gradient method to solve the discounted optimal optimal rates for Conslinear Decision (MDPs)<n>We also provide finitesample guarantees for two sample-based NPG-PD algorithms.
arXiv Detail & Related papers (2022-06-06T04:28:04Z)
Faster Algorithm and Sharper Analysis for Constrained Markov Decision Process [56.55075925645864]
The problem of constrained decision process (CMDP) is investigated, where an agent aims to maximize the expected accumulated discounted reward subject to multiple constraints. A new utilities-dual convex approach is proposed with novel integration of three ingredients: regularized policy, dual regularizer, and Nesterov's gradient descent dual. This is the first demonstration that nonconcave CMDP problems can attain the lower bound of $mathcal O (1/epsilon)$ for all complexity optimization subject to convex constraints.
arXiv Detail & Related papers (2021-10-20T02:57:21Z)
Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning. We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z)
Convergence of adaptive algorithms for weakly convex constrained optimization [59.36386973876765]
We prove the $mathcaltilde O(t-1/4)$ rate of convergence for the norm of the gradient of Moreau envelope. Our analysis works with mini-batch size of $1$, constant first and second order moment parameters, and possibly smooth optimization domains.
arXiv Detail & Related papers (2020-06-11T17:43:19Z)
Proximal Gradient Algorithm with Momentum and Flexible Parameter Restart for Nonconvex Optimization [73.38702974136102]
Various types of parameter restart schemes have been proposed for accelerated algorithms to facilitate their practical convergence in rates. In this paper, we propose an algorithm for solving nonsmooth problems.
arXiv Detail & Related papers (2020-02-26T16:06:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.