Towards Painless Policy Optimization for Constrained MDPs
- URL: http://arxiv.org/abs/2204.05176v1
- Date: Mon, 11 Apr 2022 15:08:09 GMT
- Title: Towards Painless Policy Optimization for Constrained MDPs
- Authors: Arushi Jain, Sharan Vaswani, Reza Babanezhad, Csaba Szepesvari, Doina
Precup
- Abstract summary: We study policy optimization in an infinite horizon, $gamma$-discounted constrained Markov decision process (CMDP)
Our objective is to return a policy that achieves large expected reward with a small constraint violation.
We propose a generic primal-dual framework that allows us to bound the reward sub-optimality and constraint violation for arbitrary algorithms.
- Score: 46.12526917024248
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study policy optimization in an infinite horizon, $\gamma$-discounted
constrained Markov decision process (CMDP). Our objective is to return a policy
that achieves large expected reward with a small constraint violation. We
consider the online setting with linear function approximation and assume
global access to the corresponding features. We propose a generic primal-dual
framework that allows us to bound the reward sub-optimality and constraint
violation for arbitrary algorithms in terms of their primal and dual regret on
online linear optimization problems. We instantiate this framework to use
coin-betting algorithms and propose the Coin Betting Politex (CBP) algorithm.
Assuming that the action-value functions are $\varepsilon_b$-close to the span
of the $d$-dimensional state-action features and no sampling errors, we prove
that $T$ iterations of CBP result in an $O\left(\frac{1}{(1 - \gamma)^3
\sqrt{T}} + \frac{\varepsilon_b\sqrt{d}}{(1 - \gamma)^2} \right)$ reward
sub-optimality and an $O\left(\frac{1}{(1 - \gamma)^2 \sqrt{T}} +
\frac{\varepsilon_b \sqrt{d}}{1 - \gamma} \right)$ constraint violation.
Importantly, unlike gradient descent-ascent and other recent methods, CBP does
not require extensive hyperparameter tuning. Via experiments on synthetic and
Cartpole environments, we demonstrate the effectiveness and robustness of CBP.
Related papers
- Sample Complexity Bounds for Linear Constrained MDPs with a Generative Model [16.578348944264505]
We consider infinite-horizon $gamma$-discounted (linear) constrained Markov decision processes (CMDPs)<n>The objective is to find a policy that maximizes the expected cumulative reward subject to expected cumulative constraints.<n>We propose a primal-dual framework that can leverage any black-box unconstrained MDP solver.
arXiv Detail & Related papers (2025-07-02T19:07:37Z) - Conservative Contextual Bandits: Beyond Linear Representations [29.5947486908322]
Conservative Con Bandits (CCBs) address safety in sequential decision making by requiring that an agent's policy, along with minimizing regret, also satisfies a safety constraint.<n>We develop two algorithms $mathttC-SquareCB$ and $mathttC-FastCB$, using Inverse Gap Weighting (IGW) based exploration and an online regression oracle.<n>We show that the safety constraint is satisfied with high probability and that the regret of $mathttC-SquareCB$ is sub-linear in horizon $T$, while the
arXiv Detail & Related papers (2024-12-09T02:57:27Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Model-Free, Regret-Optimal Best Policy Identification in Online CMDPs [17.62509045102346]
This paper considers the best policy identification problem in online Constrained Markov Decision Processes (CMDPs)
We are interested in algorithms that are model-free, have low regret, and identify an approximately optimal policy with a high probability.
Existing model-free algorithms for online CMDPs with sublinear regret and constraint violation do not provide any convergence guarantee to an optimal policy.
arXiv Detail & Related papers (2023-09-27T04:33:09Z) - Near Sample-Optimal Reduction-based Policy Learning for Average Reward
MDP [58.13930707612128]
This work considers the sample complexity of obtaining an $varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP)
We prove an upper bound of $widetilde O(H varepsilon-3 ln frac1delta)$ samples per state-action pair, where $H := sp(h*)$ is the span of bias of any optimal policy, $varepsilon$ is the accuracy and $delta$ is the failure probability.
arXiv Detail & Related papers (2022-12-01T15:57:58Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Conservative Natural Policy Gradient Primal-Dual Algorithm [42.83837408373223]
We consider the problem of constrained Markov decision process (CMDP) in continuous state-actions spaces.
We propose a novel Conservative Natural Policy Gradient Primal-Dual Algorithm (C-NPG-PD) to achieve zero constraint violation.
arXiv Detail & Related papers (2022-06-12T22:31:43Z) - Achieving Zero Constraint Violation for Constrained Reinforcement
Learning via Primal-Dual Approach [37.80609997145897]
Reinforcement learning is widely used in applications where one needs to perform sequential decisions while interacting with the environment.
The problem becomes more challenging when the decision requirement includes satisfying some safety constraints.
Various algorithms are available to solve CMDP problems in a model-free manner to achieve $epsilon$-optimal cumulative reward with $epsilon$ feasible policies.
An important question here is whether we can achieve $epsilon$-optimal cumulative reward with zero constraint violations or not.
arXiv Detail & Related papers (2021-09-13T21:27:03Z) - Nearly Minimax Optimal Reward-free Reinforcement Learning [88.75843804630772]
We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions.
We give a new efficient algorithm, textbfStaged textbfSampling + textbfTruncated textbfPlanning (algoname), which interacts with the environment at most $Oleft( fracS2Aepsilon2textpolylogleft(fracSAHepsilon2
arXiv Detail & Related papers (2020-10-12T17:51:19Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.