Learning Policies with Zero or Bounded Constraint Violation for
Constrained MDPs
- URL: http://arxiv.org/abs/2106.02684v1
- Date: Fri, 4 Jun 2021 19:46:55 GMT
- Title: Learning Policies with Zero or Bounded Constraint Violation for
Constrained MDPs
- Authors: Tao Liu, Ruida Zhou, Dileep Kalathil, P. R. Kumar, Chao Tian
- Abstract summary: We pose the problem in an episodic framework of a constrained Markov decision process.
It is possible to achieve a reward regret of $tildemathcalO(sqrtK)$ while allowing an $tildemathcalO(sqrtK)$ constraint violation.
We show that when a strictly safe policy is known, then one can confine the system to zero constraint violation with arbitrarily high probability.
- Score: 17.825031573375725
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We address the issue of safety in reinforcement learning. We pose the problem
in an episodic framework of a constrained Markov decision process. Existing
results have shown that it is possible to achieve a reward regret of
$\tilde{\mathcal{O}}(\sqrt{K})$ while allowing an
$\tilde{\mathcal{O}}(\sqrt{K})$ constraint violation in $K$ episodes. A
critical question that arises is whether it is possible to keep the constraint
violation even smaller. We show that when a strictly safe policy is known, then
one can confine the system to zero constraint violation with arbitrarily high
probability while keeping the reward regret of order
$\tilde{\mathcal{O}}(\sqrt{K})$. The algorithm which does so employs the
principle of optimistic pessimism in the face of uncertainty to achieve safe
exploration. When no strictly safe policy is known, though one is known to
exist, then it is possible to restrict the system to bounded constraint
violation with arbitrarily high probability. This is shown to be realized by a
primal-dual algorithm with an optimistic primal estimate and a pessimistic dual
update.
Related papers
- Learning Constrained Markov Decision Processes With Non-stationary Rewards and Constraints [34.7178680288326]
In constrained Markov decision processes (CMDPs) with adversarial rewards and constraints, a well-known impossibility result prevents any algorithm from attaining sublinear regret and sublinear constraint violation.
We show that this negative result can be eased in CMDPs with non-stationary rewards and constraints, by providing algorithms whose performances smoothly degrade as non-stationarity increases.
arXiv Detail & Related papers (2024-05-23T09:48:48Z) - Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
We show that our algorithm achieves an $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ are the cardinalities of the state and action spaces
arXiv Detail & Related papers (2023-05-15T05:37:32Z) - A Near-Optimal Algorithm for Safe Reinforcement Learning Under
Instantaneous Hard Constraints [43.895798638743784]
We develop the first near-optimal safe RL algorithm for episodic Markov Decision Processes with unsafe states and actions.
It achieves a regret $tildeO(fracd H3 sqrtdKDelta_c)$ that tightly matches the state-of-the-art regret in the setting.
We also provide a lower bound $tildeOmega(maxdH sqrtK, fracHDelta_c2)$, which indicates that the dependency on $
arXiv Detail & Related papers (2023-02-08T23:42:04Z) - Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$.
We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes.
We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z) - Safe Posterior Sampling for Constrained MDPs with Bounded Constraint
Violation [8.849815837266977]
Constrained Markov decision processes (CMDPs) model scenarios of sequential decision making with multiple objectives that are increasingly important in many applications.
We propose the Safe PSRL (posterior sampling-based RL) algorithm that does not need such assumptions and yet performs very well.
We establish a sub-linear $tildemathcal Oleft(H2.5 sqrt|mathcalS|2 |mathcalA| K right)$ upper bound on the Bayesian reward objective regret along with a bounded, i.
arXiv Detail & Related papers (2023-01-27T06:18:25Z) - Complete Policy Regret Bounds for Tallying Bandits [51.039677652803675]
Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary.
We study restrictions on the adversary that enable efficient minimization of the emphcomplete policy regret
We provide an algorithm that w.h.p a complete policy regret guarantee of $tildemathcalO(mKsqrtT)$, where the $tildemathcalO$ notation hides only logarithmic factors.
arXiv Detail & Related papers (2022-04-24T03:10:27Z) - Safe Exploration for Constrained Reinforcement Learning with Provable
Guarantees [2.379828460137829]
We propose a model-based safe RL algorithm that we call the Optimistic-Pessimistic Safe Reinforcement Learning (OPSRL) algorithm.
We show that it achieves an $tildemathcalO(S2sqrtA H7K/ (barC - barC_b)$ cumulative regret without violating the safety constraints during learning.
arXiv Detail & Related papers (2021-12-01T23:21:48Z) - Achieving Zero Constraint Violation for Constrained Reinforcement
Learning via Primal-Dual Approach [37.80609997145897]
Reinforcement learning is widely used in applications where one needs to perform sequential decisions while interacting with the environment.
The problem becomes more challenging when the decision requirement includes satisfying some safety constraints.
Various algorithms are available to solve CMDP problems in a model-free manner to achieve $epsilon$-optimal cumulative reward with $epsilon$ feasible policies.
An important question here is whether we can achieve $epsilon$-optimal cumulative reward with zero constraint violations or not.
arXiv Detail & Related papers (2021-09-13T21:27:03Z) - Safe Learning under Uncertain Objectives and Constraints [66.05180398174286]
We consider non-textitunknown yet safety-critical optimization problems under textitunknown yet safety-critical constraints.
Such problems naturally arise in a variety of domains including robotics, manufacturing, and medical procedures.
A crucial component of our analysis is to introduce and apply a technique called shrinkage in the context of safe optimization.
arXiv Detail & Related papers (2020-06-23T20:51:00Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.