A Near-Optimal Algorithm for Safe Reinforcement Learning Under
Instantaneous Hard Constraints
- URL: http://arxiv.org/abs/2302.04375v1
- Date: Wed, 8 Feb 2023 23:42:04 GMT
- Title: A Near-Optimal Algorithm for Safe Reinforcement Learning Under
Instantaneous Hard Constraints
- Authors: Ming Shi, Yingbin Liang, Ness Shroff
- Abstract summary: We develop the first near-optimal safe RL algorithm for episodic Markov Decision Processes with unsafe states and actions.
It achieves a regret $tildeO(fracd H3 sqrtdKDelta_c)$ that tightly matches the state-of-the-art regret in the setting.
We also provide a lower bound $tildeOmega(maxdH sqrtK, fracHDelta_c2)$, which indicates that the dependency on $
- Score: 43.895798638743784
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In many applications of Reinforcement Learning (RL), it is critically
important that the algorithm performs safely, such that instantaneous hard
constraints are satisfied at each step, and unsafe states and actions are
avoided. However, existing algorithms for ''safe'' RL are often designed under
constraints that either require expected cumulative costs to be bounded or
assume all states are safe. Thus, such algorithms could violate instantaneous
hard constraints and traverse unsafe states (and actions) in practice.
Therefore, in this paper, we develop the first near-optimal safe RL algorithm
for episodic Markov Decision Processes with unsafe states and actions under
instantaneous hard constraints and the linear mixture model. It not only
achieves a regret $\tilde{O}(\frac{d H^3 \sqrt{dK}}{\Delta_c})$ that tightly
matches the state-of-the-art regret in the setting with only unsafe actions and
nearly matches that in the unconstrained setting, but is also safe at each
step, where $d$ is the feature-mapping dimension, $K$ is the number of
episodes, $H$ is the number of steps in each episode, and $\Delta_c$ is a
safety-related parameter. We also provide a lower bound
$\tilde{\Omega}(\max\{dH \sqrt{K}, \frac{H}{\Delta_c^2}\})$, which indicates
that the dependency on $\Delta_c$ is necessary. Further, both our algorithm
design and regret analysis involve several novel ideas, which may be of
independent interest.
Related papers
- Safe Reinforcement Learning with Instantaneous Constraints: The Role of
Aggressive Exploration [20.630973009400574]
We study safe Reinforcement Learning (safe RL) with linear function approximation and under hard instantaneous constraints.
Our proposed algorithm, LSVI-AE, achieves $tildecO(sqrtd3H4K)$ hard constraint violation when the cost function is linear and $cO(Hgamma_K sqrtK)$ hard constraint violation when the cost function belongs to RKHS.
arXiv Detail & Related papers (2023-12-22T06:45:45Z) - Provably Efficient Reinforcement Learning via Surprise Bound [66.15308700413814]
We propose a provably efficient reinforcement learning algorithm (both computationally and statistically) with general value function approximations.
Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting.
arXiv Detail & Related papers (2023-02-22T20:21:25Z) - Provably Safe Reinforcement Learning with Step-wise Violation
Constraints [26.020907891512596]
We consider stricter step-wise violation constraints and do not assume the existence of safe actions.
We propose a novel algorithm SUCBVI, which guarantees $widetildeO(sqrtST)$ step-wise violation and $widetildeO(sqrtH3SAT)$ regret.
We also study a novel safe reward-free exploration problem with step-wise violation constraints.
arXiv Detail & Related papers (2023-02-13T02:56:04Z) - Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$.
We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes.
We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z) - Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear
Contextual Bandits and Markov Decision Processes [59.61248760134937]
We propose an efficient algorithm to achieve a regret of $tildeO(sqrtT+zeta)$.
The proposed algorithm relies on the recently developed uncertainty-weighted least-squares regression from linear contextual bandit.
We generalize our algorithm to the episodic MDP setting and first achieve an additive dependence on the corruption level $zeta$.
arXiv Detail & Related papers (2022-12-12T15:04:56Z) - Safe Exploration for Constrained Reinforcement Learning with Provable
Guarantees [2.379828460137829]
We propose a model-based safe RL algorithm that we call the Optimistic-Pessimistic Safe Reinforcement Learning (OPSRL) algorithm.
We show that it achieves an $tildemathcalO(S2sqrtA H7K/ (barC - barC_b)$ cumulative regret without violating the safety constraints during learning.
arXiv Detail & Related papers (2021-12-01T23:21:48Z) - Safe Reinforcement Learning with Linear Function Approximation [48.75026009895308]
We introduce safety as an unknown linear cost function of states and actions, which must always fall below a certain threshold.
We then present algorithms, termed SLUCB-QVI and RSLUCB-QVI, for episodic Markov decision processes (MDPs) with linear function approximation.
We show that SLUCB-QVI and RSLUCB-QVI, while with emphno safety violation, achieve a $tildemathcalOleft(kappasqrtd3H3Tright)$ regret, nearly matching
arXiv Detail & Related papers (2021-06-11T08:46:57Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.