Convergence Theorems for Entropy-Regularized and Distributional Reinforcement Learning
- URL: http://arxiv.org/abs/2510.08526v1
- Date: Thu, 09 Oct 2025 17:50:07 GMT
- Title: Convergence Theorems for Entropy-Regularized and Distributional Reinforcement Learning
- Authors: Yash Jhaveri, Harley Wiltzer, Patrick Shafto, Marc G. Bellemare, David Meger,
- Abstract summary: We present a theoretical framework for policy optimization that guarantees convergence to a particular optimal policy.<n>Our approach realizes an interpretable, diversity-preserving optimal policy as the regularization temperature vanishes.<n>We present an algorithm that estimates, to arbitrary accuracy, the return distribution associated to its interpretable, diversity-preserving optimal policy.
- Score: 28.409877186788744
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the pursuit of finding an optimal policy, reinforcement learning (RL) methods generally ignore the properties of learned policies apart from their expected return. Thus, even when successful, it is difficult to characterize which policies will be learned and what they will do. In this work, we present a theoretical framework for policy optimization that guarantees convergence to a particular optimal policy, via vanishing entropy regularization and a temperature decoupling gambit. Our approach realizes an interpretable, diversity-preserving optimal policy as the regularization temperature vanishes and ensures the convergence of policy derived objects--value functions and return distributions. In a particular instance of our method, for example, the realized policy samples all optimal actions uniformly. Leveraging our temperature decoupling gambit, we present an algorithm that estimates, to arbitrary accuracy, the return distribution associated to its interpretable, diversity-preserving optimal policy.
Related papers
- Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning [66.4260157478436]
We study reinforcement learning in the policy learning setting.<n>The goal is to find a policy whose performance is competitive with the best policy in a given class of interest.
arXiv Detail & Related papers (2025-07-06T14:40:05Z) - Convergence of Policy Mirror Descent Beyond Compatible Function Approximation [66.4260157478436]
We develop theoretical PMD general policy classes where we strictly assume a weaker variational dominance and obtain convergence to the best-in-class policy.<n>Our main notion leverages a novel notion induced by the local norm induced by the occupancy- gradient measure.
arXiv Detail & Related papers (2025-02-16T08:05:46Z) - Strongly-polynomial time and validation analysis of policy gradient methods [3.722665817361884]
This paper proposes a novel termination criterion, termed the advantage gap function, for finite state and action Markov decision processes (MDP) and reinforcement learning (RL)<n>By incorporating this advantage gap function into the design of step size rules, we deriving a new linear rate of convergence that is independent of the stationary state distribution of the optimal policy.<n>This is the first time that such strong convergence properties have been established for policy gradient methods.
arXiv Detail & Related papers (2024-09-28T18:56:48Z) - Policy Gradient Algorithms Implicitly Optimize by Continuation [7.351769270728942]
We argue that exploration in policy-gradient algorithms consists in a continuation of the return of the policy at hand, and that policies should be history-dependent rather than to maximize the return.
arXiv Detail & Related papers (2023-05-11T14:50:20Z) - Coordinate Ascent for Off-Policy RL with Global Convergence Guarantees [8.610425739792284]
We revisit the domain of off-policy policy optimization in RL.
One commonly-used approach is to leverage the off-policy policy gradient to optimize a surrogate objective.
This approach has been shown to suffer from the distribution mismatch issue.
arXiv Detail & Related papers (2022-12-10T07:47:04Z) - Understanding the Effect of Stochasticity in Policy Optimization [86.7574122154668]
We show that the preferability of optimization methods depends critically on whether exact gradients are used.
Second, to explain these findings we introduce the concept of committal rate for policy optimization.
Third, we show that in the absence of external oracle information, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely.
arXiv Detail & Related papers (2021-10-29T06:35:44Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z) - On the Convergence and Sample Efficiency of Variance-Reduced Policy
Gradient Method [38.34416337932712]
Policy gives rise to a rich class of reinforcement learning (RL) methods, for example the REINFORCE.
Yet the best known sample complexity result for such methods to find an $epsilon$-optimal policy is $mathcalO(epsilon-3)$, which is suboptimal.
We study the fundamental convergence properties and sample efficiency of first-order policy optimization method.
arXiv Detail & Related papers (2021-02-17T07:06:19Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Fast Global Convergence of Natural Policy Gradient Methods with Entropy
Regularization [44.24881971917951]
Natural policy gradient (NPG) methods are among the most widely used policy optimization algorithms.
We develop convergence guarantees for entropy-regularized NPG methods under softmax parameterization.
Our results accommodate a wide range of learning rates, and shed light upon the role of entropy regularization in enabling fast convergence.
arXiv Detail & Related papers (2020-07-13T17:58:41Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.