Related papers: Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs

Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs

URL: http://arxiv.org/abs/2106.02684v1
Date: Fri, 4 Jun 2021 19:46:55 GMT
Title: Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs
Authors: Tao Liu, Ruida Zhou, Dileep Kalathil, P. R. Kumar, Chao Tian
Abstract summary: We pose the problem in an episodic framework of a constrained Markov decision process. It is possible to achieve a reward regret of $tildemathcalO(sqrtK)$ while allowing an $tildemathcalO(sqrtK)$ constraint violation. We show that when a strictly safe policy is known, then one can confine the system to zero constraint violation with arbitrarily high probability.
Score: 17.825031573375725
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We address the issue of safety in reinforcement learning. We pose the problem in an episodic framework of a constrained Markov decision process. Existing results have shown that it is possible to achieve a reward regret of $\tilde{\mathcal{O}}(\sqrt{K})$ while allowing an $\tilde{\mathcal{O}}(\sqrt{K})$ constraint violation in $K$ episodes. A critical question that arises is whether it is possible to keep the constraint violation even smaller. We show that when a strictly safe policy is known, then one can confine the system to zero constraint violation with arbitrarily high probability while keeping the reward regret of order $\tilde{\mathcal{O}}(\sqrt{K})$. The algorithm which does so employs the principle of optimistic pessimism in the face of uncertainty to achieve safe exploration. When no strictly safe policy is known, though one is known to exist, then it is possible to restrict the system to bounded constraint violation with arbitrarily high probability. This is shown to be realized by a primal-dual algorithm with an optimistic primal estimate and a pessimistic dual update.

Related papers

Beyond Slater's Condition in Online CMDPs with Stochastic and Adversarial Constraints [33.41566575424402]
We study emphonline episodic Constrained Markov Decision Processes (CMDPs) under both and adversarial constraints.<n>We provide a novel algorithm whose guarantees greatly improve those of the state-of-the-art best-of-both-worlds algorithm introduced by Stradi et al.<n>In the adversarial regime, emphi.e., our algorithm ensures sublinear constraint violation without Slater's condition, and sublinear $alpha$-regret with respect to the emphunconstrained optimum.
arXiv Detail & Related papers (2025-09-24T13:38:32Z)
An Optimistic Algorithm for Online Convex Optimization with Adversarial Constraints [55.2480439325792]
We study Online Convex Optimization (OCO) with adversarial constraints.<n>We focus on a setting where the algorithm has access to predictions of the loss and constraint functions.<n>Our results show that we can improve the current best bounds of $ O(sqrtT) $ regret and $ tildeO(sqrtT) $ cumulative constraint violations.
arXiv Detail & Related papers (2024-12-11T03:06:42Z)
Learning Constrained Markov Decision Processes With Non-stationary Rewards and Constraints [34.7178680288326]
In constrained Markov decision processes (CMDPs) with adversarial rewards and constraints, a well-known impossibility result prevents any algorithm from attaining sublinear regret and sublinear constraint violation. We show that this negative result can be eased in CMDPs with non-stationary rewards and constraints, by providing algorithms whose performances smoothly degrade as non-stationarity increases.
arXiv Detail & Related papers (2024-05-23T09:48:48Z)
Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
We show that our algorithm achieves an $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ are the cardinalities of the state and action spaces
arXiv Detail & Related papers (2023-05-15T05:37:32Z)
A Near-Optimal Algorithm for Safe Reinforcement Learning Under Instantaneous Hard Constraints [43.895798638743784]
We develop the first near-optimal safe RL algorithm for episodic Markov Decision Processes with unsafe states and actions. It achieves a regret $tildeO(fracd H3 sqrtdKDelta_c)$ that tightly matches the state-of-the-art regret in the setting. We also provide a lower bound $tildeOmega(maxdH sqrtK, fracHDelta_c2)$, which indicates that the dependency on $
arXiv Detail & Related papers (2023-02-08T23:42:04Z)
Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$. We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes. We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z)
Safe Posterior Sampling for Constrained MDPs with Bounded Constraint Violation [8.849815837266977]
Constrained Markov decision processes (CMDPs) model scenarios of sequential decision making with multiple objectives that are increasingly important in many applications. We propose the Safe PSRL (posterior sampling-based RL) algorithm that does not need such assumptions and yet performs very well. We establish a sub-linear $tildemathcal Oleft(H2.5 sqrt|mathcalS|2 |mathcalA| K right)$ upper bound on the Bayesian reward objective regret along with a bounded, i.
arXiv Detail & Related papers (2023-01-27T06:18:25Z)
Complete Policy Regret Bounds for Tallying Bandits [51.039677652803675]
Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary. We study restrictions on the adversary that enable efficient minimization of the emphcomplete policy regret We provide an algorithm that w.h.p a complete policy regret guarantee of $tildemathcalO(mKsqrtT)$, where the $tildemathcalO$ notation hides only logarithmic factors.
arXiv Detail & Related papers (2022-04-24T03:10:27Z)
Safe Exploration for Constrained Reinforcement Learning with Provable Guarantees [2.379828460137829]
We propose a model-based safe RL algorithm that we call the Optimistic-Pessimistic Safe Reinforcement Learning (OPSRL) algorithm. We show that it achieves an $tildemathcalO(S2sqrtA H7K/ (barC - barC_b)$ cumulative regret without violating the safety constraints during learning.
arXiv Detail & Related papers (2021-12-01T23:21:48Z)
Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Primal-Dual Approach [37.80609997145897]
Reinforcement learning is widely used in applications where one needs to perform sequential decisions while interacting with the environment. The problem becomes more challenging when the decision requirement includes satisfying some safety constraints. Various algorithms are available to solve CMDP problems in a model-free manner to achieve $epsilon$-optimal cumulative reward with $epsilon$ feasible policies. An important question here is whether we can achieve $epsilon$-optimal cumulative reward with zero constraint violations or not.
arXiv Detail & Related papers (2021-09-13T21:27:03Z)
Safe Learning under Uncertain Objectives and Constraints [66.05180398174286]
We consider non-textitunknown yet safety-critical optimization problems under textitunknown yet safety-critical constraints. Such problems naturally arise in a variety of domains including robotics, manufacturing, and medical procedures. A crucial component of our analysis is to introduce and apply a technique called shrinkage in the context of safe optimization.
arXiv Detail & Related papers (2020-06-23T20:51:00Z)
Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies. We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation. We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.