Achieving Zero Constraint Violation for Constrained Reinforcement
Learning via Primal-Dual Approach
- URL: http://arxiv.org/abs/2109.06332v1
- Date: Mon, 13 Sep 2021 21:27:03 GMT
- Title: Achieving Zero Constraint Violation for Constrained Reinforcement
Learning via Primal-Dual Approach
- Authors: Qinbo Bai, Amrit Singh Bedi, Mridul Agarwal, Alec Koppel and Vaneet
Aggarwal
- Abstract summary: Reinforcement learning is widely used in applications where one needs to perform sequential decisions while interacting with the environment.
The problem becomes more challenging when the decision requirement includes satisfying some safety constraints.
Various algorithms are available to solve CMDP problems in a model-free manner to achieve $epsilon$-optimal cumulative reward with $epsilon$ feasible policies.
An important question here is whether we can achieve $epsilon$-optimal cumulative reward with zero constraint violations or not.
- Score: 37.80609997145897
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning is widely used in applications where one needs to
perform sequential decisions while interacting with the environment. The
problem becomes more challenging when the decision requirement includes
satisfying some safety constraints. The problem is mathematically formulated as
constrained Markov decision process (CMDP). In the literature, various
algorithms are available to solve CMDP problems in a model-free manner to
achieve $\epsilon$-optimal cumulative reward with $\epsilon$ feasible policies.
An $\epsilon$-feasible policy implies that it suffers from constraint
violation. An important question here is whether we can achieve
$\epsilon$-optimal cumulative reward with zero constraint violations or not. To
achieve that, we advocate the use of a randomized primal-dual approach to
solving the CMDP problems and propose a conservative stochastic primal-dual
algorithm (CSPDA) which is shown to exhibit $\tilde{\mathcal{O}}(1/\epsilon^2)$
sample complexity to achieve $\epsilon$-optimal cumulative reward with zero
constraint violations. In the prior works, the best available sample complexity
for the $\epsilon$-optimal policy with zero constraint violation is
$\tilde{\mathcal{O}}(1/\epsilon^5)$. Hence, the proposed algorithm provides a
significant improvement as compared to the state of the art.
Related papers
- Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Model-Free, Regret-Optimal Best Policy Identification in Online CMDPs [17.62509045102346]
This paper considers the best policy identification problem in online Constrained Markov Decision Processes (CMDPs)
We are interested in algorithms that are model-free, have low regret, and identify an approximately optimal policy with a high probability.
Existing model-free algorithms for online CMDPs with sublinear regret and constraint violation do not provide any convergence guarantee to an optimal policy.
arXiv Detail & Related papers (2023-09-27T04:33:09Z) - Safe Posterior Sampling for Constrained MDPs with Bounded Constraint
Violation [8.849815837266977]
Constrained Markov decision processes (CMDPs) model scenarios of sequential decision making with multiple objectives that are increasingly important in many applications.
We propose the Safe PSRL (posterior sampling-based RL) algorithm that does not need such assumptions and yet performs very well.
We establish a sub-linear $tildemathcal Oleft(H2.5 sqrt|mathcalS|2 |mathcalA| K right)$ upper bound on the Bayesian reward objective regret along with a bounded, i.
arXiv Detail & Related papers (2023-01-27T06:18:25Z) - Near Sample-Optimal Reduction-based Policy Learning for Average Reward
MDP [58.13930707612128]
This work considers the sample complexity of obtaining an $varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP)
We prove an upper bound of $widetilde O(H varepsilon-3 ln frac1delta)$ samples per state-action pair, where $H := sp(h*)$ is the span of bias of any optimal policy, $varepsilon$ is the accuracy and $delta$ is the failure probability.
arXiv Detail & Related papers (2022-12-01T15:57:58Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Towards Painless Policy Optimization for Constrained MDPs [46.12526917024248]
We study policy optimization in an infinite horizon, $gamma$-discounted constrained Markov decision process (CMDP)
Our objective is to return a policy that achieves large expected reward with a small constraint violation.
We propose a generic primal-dual framework that allows us to bound the reward sub-optimality and constraint violation for arbitrary algorithms.
arXiv Detail & Related papers (2022-04-11T15:08:09Z) - Model-Free Algorithm and Regret Analysis for MDPs with Long-Term
Constraints [38.2783003051101]
This paper uses concepts from constrained optimization and Q-learning to propose an algorithm for CMDP with long-term constraints.
We note that these are the first results on regret analysis for MDP with long-term constraints, where the transition probabilities are not known apriori.
arXiv Detail & Related papers (2020-06-10T17:19:29Z) - Provably Efficient Model-Free Algorithm for MDPs with Peak Constraints [38.2783003051101]
This paper considers the peak Constrained Markov Decision Process (PCMDP), where the agent chooses the policy to maximize total reward in the finite horizon as well as satisfy constraints at each epoch with probability 1.
We propose a model-free algorithm that converts PCMDP problem to an unconstrained problem and a Q-learning based approach is applied.
arXiv Detail & Related papers (2020-03-11T23:23:29Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.