On the Convergence of Policy Mirror Descent with Temporal Difference Evaluation
- URL: http://arxiv.org/abs/2509.18822v1
- Date: Tue, 23 Sep 2025 09:11:03 GMT
- Title: On the Convergence of Policy Mirror Descent with Temporal Difference Evaluation
- Authors: Jiacai Liu, Wenye Li, Ke Wei,
- Abstract summary: Policy mirror descent (PMD) is a general policy optimization framework in reinforcement learning.<n>We consider policy mirror descent with temporal difference evaluation (TD-PMD)
- Score: 5.185426731431962
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Policy mirror descent (PMD) is a general policy optimization framework in reinforcement learning, which can cover a wide range of typical policy optimization methods by specifying different mirror maps. Existing analysis of PMD requires exact or approximate evaluation (for example unbiased estimation via Monte Carlo simulation) of action values solely based on policy. In this paper, we consider policy mirror descent with temporal difference evaluation (TD-PMD). It is shown that, given the access to exact policy evaluations, the dimension-free $O(1/T)$ sublinear convergence still holds for TD-PMD with any constant step size and any initialization. In order to achieve this result, new monotonicity and shift invariance arguments have been developed. The dimension free $\gamma$-rate linear convergence of TD-PMD is also established provided the step size is selected adaptively. For the two common instances of TD-PMD (i.e., TD-PQA and TD-NPG), it is further shown that they enjoy the convergence in the policy domain. Additionally, we investigate TD-PMD in the inexact setting and give the sample complexity for it to achieve the last iterate $\varepsilon$-optimality under a generative model, which improves the last iterate sample complexity for PMD over the dependence on $1/(1-\gamma)$.
Related papers
- Policy Mirror Descent with Temporal Difference Learning: Sample Complexity under Online Markov Data [7.423079681233031]
This paper studies the policy mirror descent (PMD) method, which is a general policy optimization framework in reinforcement learning.<n>Two algorithms called Expected TD-PMD and Approximate TD-PMD have been presented, which are off-policy and mixed policy algorithms respectively.
arXiv Detail & Related papers (2025-12-30T07:57:03Z) - Closing the Approximation Gap of Partial AUC Optimization: A Tale of Two Formulations [121.39938773554523]
The Area Under the ROC Curve (AUC) is a pivotal evaluation metric in real-world scenarios with both class imbalance and decision constraints.<n>We present two simple instance-wise minimax reformulations to close the approximation gap of PAUC optimization.<n>The resulting algorithms enjoy a linear per-iteration computational complexity w.r.t. the sample size and a convergence rate of $O(-2/3)$ for typical one-way and two-way PAUCs.
arXiv Detail & Related papers (2025-12-01T02:52:33Z) - Achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ Regret in Average-Reward POMDPs with Known Observation Models [69.1820058966619]
We tackle average-reward infinite-horizon POMDPs with an unknown transition model.<n>We present a novel and simple estimator that overcomes this barrier.
arXiv Detail & Related papers (2025-01-30T22:29:41Z) - Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting [56.92178753201331]
We propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy.
We show the consistency of the OAS procedure, and we prove a regret guarantee of order $mathcalO(sqrtT log(T)$ for the proposed OAS-UCRL algorithm.
arXiv Detail & Related papers (2024-10-02T08:46:34Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Policy Mirror Descent with Lookahead [0.46040036610482665]
Policy Mirror Descent (PMD) can be seen as a soft Policy It algorithm implementing regularized 1-step greedy policy improvement.
We propose a new class of PMD algorithms called $h$-PMD which incorporates multi-step greedy policy improvement.
We show that $h$-PMD enjoys a faster dimension-free $gammah$-linear convergence rate, contingent on the computation of multi-step greedy policies.
arXiv Detail & Related papers (2024-03-21T06:10:51Z) - On the Global Convergence of Policy Gradient in Average Reward Markov
Decision Processes [50.68789924454235]
We present the first finite time global convergence analysis of policy gradient in the context of average reward Markov decision processes (MDPs)
Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $Oleft(frac1Tright),$ which translates to $Oleft(log(T)right)$ regret, where $T$ represents the number of iterations.
arXiv Detail & Related papers (2024-03-11T15:25:03Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - Optimal Convergence Rate for Exact Policy Mirror Descent in Discounted
Markov Decision Processes [18.35462792871242]
Policy Mirror Descent is a family of algorithms that covers a wide range of novel and fundamental methods in reinforcement learning.
Motivated by the instability of policy iteration with inexact policy evaluation, PMD algorithmically regularises the policy improvement step of PI.
We show that the dimension-free $gamma$-rate of PI can be achieved by the general family of unregularised PMD algorithms under an adaptive step-size.
arXiv Detail & Related papers (2023-02-22T13:55:08Z) - First-order Policy Optimization for Robust Markov Decision Process [40.2022466644885]
We consider the problem of solving robust Markov decision process (MDP)
MDP involves a set of discounted, finite state, finite action space MDPs with uncertain transition kernels.
For $(mathbfs,mathbfa)$-rectangular uncertainty sets, we establish several structural observations on the robust objective.
arXiv Detail & Related papers (2022-09-21T18:10:28Z) - Homotopic Policy Mirror Descent: Policy Convergence, Implicit
Regularization, and Improved Sample Complexity [40.2022466644885]
homotopic policy mirror descent (HPMD) method for solving discounted, infinite horizon MDPs with finite state and action space.
We report three properties that seem to be new in the literature of policy gradient methods.
arXiv Detail & Related papers (2022-01-24T04:54:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.