Related papers: Sample-Efficient Constrained Reinforcement Learning with General Parameterization

Sample-Efficient Constrained Reinforcement Learning with General Parameterization

URL: http://arxiv.org/abs/2405.10624v3
Date: Thu, 31 Oct 2024 05:24:19 GMT
Title: Sample-Efficient Constrained Reinforcement Learning with General Parameterization
Authors: Washim Uddin Mondal, Vaneet Aggarwal,
Abstract summary: We consider a constrained Markov Decision Problem (CMDP) where the goal of an agent is to maximize the expected discounted sum of rewards over an infinite horizon. We develop the Primal-Dual Accelerated Natural Policy Gradient (PD-ANPG) algorithm that ensures an $epsilon$ global optimality gap and $epsilon$ constraint violation.
Score: 35.22742439337603
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We consider a constrained Markov Decision Problem (CMDP) where the goal of an agent is to maximize the expected discounted sum of rewards over an infinite horizon while ensuring that the expected discounted sum of costs exceeds a certain threshold. Building on the idea of momentum-based acceleration, we develop the Primal-Dual Accelerated Natural Policy Gradient (PD-ANPG) algorithm that ensures an $\epsilon$ global optimality gap and $\epsilon$ constraint violation with $\tilde{\mathcal{O}}((1-\gamma)^{-7}\epsilon^{-2})$ sample complexity for general parameterized policies where $\gamma$ denotes the discount factor. This improves the state-of-the-art sample complexity in general parameterized CMDPs by a factor of $\mathcal{O}((1-\gamma)^{-1}\epsilon^{-2})$ and achieves the theoretical lower bound in $\epsilon^{-1}$.

Related papers

A Variance-Reduced Cubic-Regularized Newton for Policy Optimization [6.52142708235708]
Existing second-order methods often suffer from suboptimal sample complexity or unrealistic assumptions about importance sampling.<n>To overcome these limitations, we propose VR-CR-PN, a variance-regularized Newton-reduced estimator.<n>As an additional contribution, we introduce a novel horizon for the expected return function, allowing the algorithm to achieve a uniform sample complexity.
arXiv Detail & Related papers (2025-07-14T10:04:02Z)
Efficiently Solving Discounted MDPs with Predictions on Transition Matrices [6.199300239433395]
We study Discounted Markov Decision Processes (DMDPs) under a generative model. We propose a novel framework to investigate how a prediction on the transition matrix can enhance the sample efficiency in solving DMDPs.
arXiv Detail & Related papers (2025-02-21T09:59:46Z)
On the Convergence of Single-Timescale Actor-Critic [49.19842488693726]
We analyze the global convergence of the single-timescale actor-critic (AC) algorithm for the infinite-horizon discounted Decision Processes (MDs) with finite state spaces.<n>We demonstrate that the step sizes for both the actor and critic must decay as ( O(k-Pfrac12) ) with $k$ diverging from the conventional ( O(k-Pfrac12) ) rates commonly used in (non- optimal) Markov framework optimization.
arXiv Detail & Related papers (2024-10-11T14:46:29Z)
Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators. Key to our solution is a novel projection technique based on ideas from harmonic analysis. Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z)
Sample-efficient Learning of Infinite-horizon Average-reward MDPs with General Function Approximation [53.17668583030862]
We study infinite-horizon average-reward Markov decision processes (AMDPs) in the context of general function approximation. We propose a novel algorithmic framework named Local-fitted Optimization with OPtimism (LOOP) We show that LOOP achieves a sublinear $tildemathcalO(mathrmpoly(d, mathrmsp(V*)) sqrtTbeta )$ regret, where $d$ and $beta$ correspond to AGEC and log-covering number of the hypothesis class respectively
arXiv Detail & Related papers (2024-04-19T06:24:22Z)
Improved Sample Complexity Analysis of Natural Policy Gradient Algorithm with General Parameterization for Infinite Horizon Discounted Reward Markov Decision Processes [41.61653528766776]
We propose the Natural Accelerated Policy Gradient (PGAN) algorithm that utilizes an accelerated gradient descent process to obtain the natural policy gradient. An iteration achieves $mathcalO(epsilon-2)$ sample complexity and $mathcalO(epsilon-1)$ complexity. In the class of Hessian-free and IS-free algorithms, ANPG beats the best-known sample complexity by a factor of $mathcalO(epsilon-frac12)$ and simultaneously matches their state-of
arXiv Detail & Related papers (2023-10-18T03:00:15Z)
Reinforcement Learning with General Utilities: Simpler Variance Reduction and Large State-Action Space [17.366915676628867]
We consider the reinforcement learning problem with general utilities. Our algorithm achieves $tildemathcalO(epsilon-3)$ and $tildemathcalO(epsilon-2)$ sample complexities.
arXiv Detail & Related papers (2023-06-02T18:16:35Z)
A Newton-CG based barrier-augmented Lagrangian method for general nonconvex conic optimization [53.044526424637866]
In this paper we consider finding an approximate second-order stationary point (SOSP) that minimizes a twice different subject general non conic optimization. In particular, we propose a Newton-CG based-augmentedconjugate method for finding an approximate SOSP.
arXiv Detail & Related papers (2023-01-10T20:43:29Z)
Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Conservative Natural Policy Gradient Primal-Dual Algorithm [42.83837408373223]
We consider the problem of constrained Markov decision process (CMDP) in continuous state-actions spaces. We propose a novel Conservative Natural Policy Gradient Primal-Dual Algorithm (C-NPG-PD) to achieve zero constraint violation.
arXiv Detail & Related papers (2022-06-12T22:31:43Z)
Towards Painless Policy Optimization for Constrained MDPs [46.12526917024248]
We study policy optimization in an infinite horizon, $gamma$-discounted constrained Markov decision process (CMDP) Our objective is to return a policy that achieves large expected reward with a small constraint violation. We propose a generic primal-dual framework that allows us to bound the reward sub-optimality and constraint violation for arbitrary algorithms.
arXiv Detail & Related papers (2022-04-11T15:08:09Z)
Fast Global Convergence of Policy Optimization for Constrained MDPs [17.825031573375725]
We show that gradient-based methods can achieve an $mathcalO(log(T)/T)$ global convergence rate both for the optimality gap and the constraint violation. When Slater's condition is satisfied and known a priori, zero constraint violation can be further guaranteed for a sufficiently large $T$.
arXiv Detail & Related papers (2021-10-31T17:46:26Z)
Faster Algorithm and Sharper Analysis for Constrained Markov Decision Process [56.55075925645864]
The problem of constrained decision process (CMDP) is investigated, where an agent aims to maximize the expected accumulated discounted reward subject to multiple constraints. A new utilities-dual convex approach is proposed with novel integration of three ingredients: regularized policy, dual regularizer, and Nesterov's gradient descent dual. This is the first demonstration that nonconcave CMDP problems can attain the lower bound of $mathcal O (1/epsilon)$ for all complexity optimization subject to convex constraints.
arXiv Detail & Related papers (2021-10-20T02:57:21Z)
Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning. We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z)
Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP) We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z)
Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation. We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.