Safe Exploration for Constrained Reinforcement Learning with Provable
Guarantees
- URL: http://arxiv.org/abs/2112.00885v1
- Date: Wed, 1 Dec 2021 23:21:48 GMT
- Title: Safe Exploration for Constrained Reinforcement Learning with Provable
Guarantees
- Authors: Archana Bura, Aria HasanzadeZonuzy, Dileep Kalathil, Srinivas
Shakkottai, and Jean-Francois Chamberland
- Abstract summary: We propose a model-based safe RL algorithm that we call the Optimistic-Pessimistic Safe Reinforcement Learning (OPSRL) algorithm.
We show that it achieves an $tildemathcalO(S2sqrtA H7K/ (barC - barC_b)$ cumulative regret without violating the safety constraints during learning.
- Score: 2.379828460137829
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider the problem of learning an episodic safe control policy that
minimizes an objective function, while satisfying necessary safety constraints
-- both during learning and deployment. We formulate this safety constrained
reinforcement learning (RL) problem using the framework of a finite-horizon
Constrained Markov Decision Process (CMDP) with an unknown transition
probability function. Here, we model the safety requirements as constraints on
the expected cumulative costs that must be satisfied during all episodes of
learning. We propose a model-based safe RL algorithm that we call the
Optimistic-Pessimistic Safe Reinforcement Learning (OPSRL) algorithm, and show
that it achieves an $\tilde{\mathcal{O}}(S^{2}\sqrt{A H^{7}K}/ (\bar{C} -
\bar{C}_{b}))$ cumulative regret without violating the safety constraints
during learning, where $S$ is the number of states, $A$ is the number of
actions, $H$ is the horizon length, $K$ is the number of learning episodes, and
$(\bar{C} - \bar{C}_{b})$ is the safety gap, i.e., the difference between the
constraint value and the cost of a known safe baseline policy. The scaling as
$\tilde{\mathcal{O}}(\sqrt{K})$ is the same as the traditional approach where
constraints may be violated during learning, which means that our algorithm
suffers no additional regret in spite of providing a safety guarantee. Our key
idea is to use an optimistic exploration approach with pessimistic constraint
enforcement for learning the policy. This approach simultaneously incentivizes
the exploration of unknown states while imposing a penalty for visiting states
that are likely to cause violation of safety constraints. We validate our
algorithm by evaluating its performance on benchmark problems against
conventional approaches.
Related papers
- Safe Reinforcement Learning with Instantaneous Constraints: The Role of
Aggressive Exploration [20.630973009400574]
We study safe Reinforcement Learning (safe RL) with linear function approximation and under hard instantaneous constraints.
Our proposed algorithm, LSVI-AE, achieves $tildecO(sqrtd3H4K)$ hard constraint violation when the cost function is linear and $cO(Hgamma_K sqrtK)$ hard constraint violation when the cost function belongs to RKHS.
arXiv Detail & Related papers (2023-12-22T06:45:45Z) - Safe Deep Reinforcement Learning by Verifying Task-Level Properties [84.64203221849648]
Cost functions are commonly employed in Safe Deep Reinforcement Learning (DRL)
The cost is typically encoded as an indicator function due to the difficulty of quantifying the risk of policy decisions in the state space.
In this paper, we investigate an alternative approach that uses domain knowledge to quantify the risk in the proximity of such states by defining a violation metric.
arXiv Detail & Related papers (2023-02-20T15:24:06Z) - Provably Safe Reinforcement Learning with Step-wise Violation
Constraints [26.020907891512596]
We consider stricter step-wise violation constraints and do not assume the existence of safe actions.
We propose a novel algorithm SUCBVI, which guarantees $widetildeO(sqrtST)$ step-wise violation and $widetildeO(sqrtH3SAT)$ regret.
We also study a novel safe reward-free exploration problem with step-wise violation constraints.
arXiv Detail & Related papers (2023-02-13T02:56:04Z) - A Near-Optimal Algorithm for Safe Reinforcement Learning Under
Instantaneous Hard Constraints [43.895798638743784]
We develop the first near-optimal safe RL algorithm for episodic Markov Decision Processes with unsafe states and actions.
It achieves a regret $tildeO(fracd H3 sqrtdKDelta_c)$ that tightly matches the state-of-the-art regret in the setting.
We also provide a lower bound $tildeOmega(maxdH sqrtK, fracHDelta_c2)$, which indicates that the dependency on $
arXiv Detail & Related papers (2023-02-08T23:42:04Z) - Enhancing Safe Exploration Using Safety State Augmentation [71.00929878212382]
We tackle the problem of safe exploration in model-free reinforcement learning.
We derive policies for scheduling the safety budget during training.
We show that Simmer can stabilize training and improve the performance of safe RL with average constraints.
arXiv Detail & Related papers (2022-06-06T15:23:07Z) - SAUTE RL: Almost Surely Safe Reinforcement Learning Using State
Augmentation [63.25418599322092]
Satisfying safety constraints almost surely (or with probability one) can be critical for deployment of Reinforcement Learning (RL) in real-life applications.
We address the problem by introducing Safety Augmented Markov Decision Processes (MDPs)
We show that Saute MDP allows to view Safe augmentation problem from a different perspective enabling new features.
arXiv Detail & Related papers (2022-02-14T08:57:01Z) - Safe Reinforcement Learning with Linear Function Approximation [48.75026009895308]
We introduce safety as an unknown linear cost function of states and actions, which must always fall below a certain threshold.
We then present algorithms, termed SLUCB-QVI and RSLUCB-QVI, for episodic Markov decision processes (MDPs) with linear function approximation.
We show that SLUCB-QVI and RSLUCB-QVI, while with emphno safety violation, achieve a $tildemathcalOleft(kappasqrtd3H3Tright)$ regret, nearly matching
arXiv Detail & Related papers (2021-06-11T08:46:57Z) - Learning Policies with Zero or Bounded Constraint Violation for
Constrained MDPs [17.825031573375725]
We pose the problem in an episodic framework of a constrained Markov decision process.
It is possible to achieve a reward regret of $tildemathcalO(sqrtK)$ while allowing an $tildemathcalO(sqrtK)$ constraint violation.
We show that when a strictly safe policy is known, then one can confine the system to zero constraint violation with arbitrarily high probability.
arXiv Detail & Related papers (2021-06-04T19:46:55Z) - Safe Learning under Uncertain Objectives and Constraints [66.05180398174286]
We consider non-textitunknown yet safety-critical optimization problems under textitunknown yet safety-critical constraints.
Such problems naturally arise in a variety of domains including robotics, manufacturing, and medical procedures.
A crucial component of our analysis is to introduce and apply a technique called shrinkage in the context of safe optimization.
arXiv Detail & Related papers (2020-06-23T20:51:00Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.