Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs
- URL: http://arxiv.org/abs/2408.11513v1
- Date: Wed, 21 Aug 2024 10:44:57 GMT
- Title: Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs
- Authors: Washim Uddin Mondal, Vaneet Aggarwal,
- Abstract summary: Proposed Primal-Dual based Regularized Accelerated Natural Policy Gradient (PDR-ANPG) algorithm uses entropy and quadratic regularizers to reach this goal.
For a parameterized policy class with transferred compatibility approximation error, PDR-ANPG achieves a last-iterate $epsilon$ optimality gap.
It is a significant improvement of the state-of-the-art last-iterate guarantees of general parameterized CMDPs.
- Score: 35.22742439337603
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of learning a Constrained Markov Decision Process (CMDP) via general parameterization. Our proposed Primal-Dual based Regularized Accelerated Natural Policy Gradient (PDR-ANPG) algorithm uses entropy and quadratic regularizers to reach this goal. For a parameterized policy class with transferred compatibility approximation error, $\epsilon_{\mathrm{bias}}$, PDR-ANPG achieves a last-iterate $\epsilon$ optimality gap and $\epsilon$ constraint violation (up to some additive factor of $\epsilon_{\mathrm{bias}}$) with a sample complexity of $\tilde{\mathcal{O}}(\epsilon^{-2}\min\{\epsilon^{-2},\epsilon_{\mathrm{bias}}^{-\frac{1}{3}}\})$. If the class is incomplete ($\epsilon_{\mathrm{bias}}>0$), then the sample complexity reduces to $\tilde{\mathcal{O}}(\epsilon^{-2})$ for $\epsilon<(\epsilon_{\mathrm{bias}})^{\frac{1}{6}}$. Moreover, for complete policies with $\epsilon_{\mathrm{bias}}=0$, our algorithm achieves a last-iterate $\epsilon$ optimality gap and $\epsilon$ constraint violation with $\tilde{\mathcal{O}}(\epsilon^{-4})$ sample complexity. It is a significant improvement of the state-of-the-art last-iterate guarantees of general parameterized CMDPs.
Related papers
- Finding good policies in average-reward Markov Decision Processes without prior knowledge [19.89784209009327]
We revisit the identification of an $varepsilon$-optimal policy in average-reward Decision (MDP)
By relying on a diameter estimation procedure, we propose the first algorithm for $(varepsilon,delta)$-PAC-PAC policy identification.
arXiv Detail & Related papers (2024-05-27T12:24:14Z) - Near-Optimal Distributed Minimax Optimization under the Second-Order Similarity [22.615156512223763]
We propose variance- optimistic sliding (SVOGS) method, which takes the advantage of the finite-sum structure in the objective.
We prove $mathcal O(delta D2/varepsilon)$, communication complexity of $mathcal O(n+sqrtndelta D2/varepsilon)$, and local calls of $tildemathcal O(n+sqrtndelta+L)D2/varepsilon)$.
arXiv Detail & Related papers (2024-05-25T08:34:49Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Span-Based Optimal Sample Complexity for Average Reward MDPs [6.996002801232415]
We study the sample complexity of learning an $varepsilon$-optimal policy in an average-reward Markov decision process (MDP) under a generative model.
We establish the complexity bound $widetildeOleft(SAfracH (1-gamma)2varepsilon2 right)$, where $H$ is the span of the bias function of the optimal policy and $SA$ is the cardinality of the state-action space.
arXiv Detail & Related papers (2023-11-22T15:34:44Z) - ReSQueing Parallel and Private Stochastic Convex Optimization [59.53297063174519]
We introduce a new tool for BFG convex optimization (SCO): a Reweighted Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density.
We develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings.
arXiv Detail & Related papers (2023-01-01T18:51:29Z) - Near Sample-Optimal Reduction-based Policy Learning for Average Reward
MDP [58.13930707612128]
This work considers the sample complexity of obtaining an $varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP)
We prove an upper bound of $widetilde O(H varepsilon-3 ln frac1delta)$ samples per state-action pair, where $H := sp(h*)$ is the span of bias of any optimal policy, $varepsilon$ is the accuracy and $delta$ is the failure probability.
arXiv Detail & Related papers (2022-12-01T15:57:58Z) - Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample
Complexity [59.34067736545355]
Given an MDP with $S$ states, $A$ actions, the discount factor $gamma in (0,1)$, and an approximation threshold $epsilon > 0$, we provide a model-free algorithm to learn an $epsilon$-optimal policy.
For small enough $epsilon$, we show an improved algorithm with sample complexity.
arXiv Detail & Related papers (2020-06-06T13:34:41Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z) - Fixed-Support Wasserstein Barycenters: Computational Hardness and Fast
Algorithm [100.11971836788437]
We study the fixed-support Wasserstein barycenter problem (FS-WBP)
We develop a provably fast textitdeterministic variant of the celebrated iterative Bregman projection (IBP) algorithm, named textscFastIBP.
arXiv Detail & Related papers (2020-02-12T03:40:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.