Related papers: A Near-Optimal Algorithm for Safe Reinforcement Learning Under Instantaneous Hard Constraints

A Near-Optimal Algorithm for Safe Reinforcement Learning Under Instantaneous Hard Constraints

URL: http://arxiv.org/abs/2302.04375v1
Date: Wed, 8 Feb 2023 23:42:04 GMT
Title: A Near-Optimal Algorithm for Safe Reinforcement Learning Under Instantaneous Hard Constraints
Authors: Ming Shi, Yingbin Liang, Ness Shroff
Abstract summary: We develop the first near-optimal safe RL algorithm for episodic Markov Decision Processes with unsafe states and actions. It achieves a regret $tildeO(fracd H3 sqrtdKDelta_c)$ that tightly matches the state-of-the-art regret in the setting. We also provide a lower bound $tildeOmega(maxdH sqrtK, fracHDelta_c2)$, which indicates that the dependency on $
Score: 43.895798638743784
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In many applications of Reinforcement Learning (RL), it is critically important that the algorithm performs safely, such that instantaneous hard constraints are satisfied at each step, and unsafe states and actions are avoided. However, existing algorithms for ''safe'' RL are often designed under constraints that either require expected cumulative costs to be bounded or assume all states are safe. Thus, such algorithms could violate instantaneous hard constraints and traverse unsafe states (and actions) in practice. Therefore, in this paper, we develop the first near-optimal safe RL algorithm for episodic Markov Decision Processes with unsafe states and actions under instantaneous hard constraints and the linear mixture model. It not only achieves a regret $\tilde{O}(\frac{d H^3 \sqrt{dK}}{\Delta_c})$ that tightly matches the state-of-the-art regret in the setting with only unsafe actions and nearly matches that in the unconstrained setting, but is also safe at each step, where $d$ is the feature-mapping dimension, $K$ is the number of episodes, $H$ is the number of steps in each episode, and $\Delta_c$ is a safety-related parameter. We also provide a lower bound $\tilde{\Omega}(\max\{dH \sqrt{K}, \frac{H}{\Delta_c^2}\})$, which indicates that the dependency on $\Delta_c$ is necessary. Further, both our algorithm design and regret analysis involve several novel ideas, which may be of independent interest.

Related papers

Provably Efficient RL for Linear MDPs under Instantaneous Safety Constraints in Non-Convex Feature Spaces [27.07625013423198]
In Reinforcement tasks with constraints, it is important to avoid errors. In this paper, we develop a novel technique to properly resolve the constraints.
arXiv Detail & Related papers (2025-02-25T21:32:55Z)
Safe Reinforcement Learning with Instantaneous Constraints: The Role of Aggressive Exploration [20.630973009400574]
We study safe Reinforcement Learning (safe RL) with linear function approximation and under hard instantaneous constraints. Our proposed algorithm, LSVI-AE, achieves $tildecO(sqrtd3H4K)$ hard constraint violation when the cost function is linear and $cO(Hgamma_K sqrtK)$ hard constraint violation when the cost function belongs to RKHS.
arXiv Detail & Related papers (2023-12-22T06:45:45Z)
Provably Efficient Reinforcement Learning via Surprise Bound [66.15308700413814]
We propose a provably efficient reinforcement learning algorithm (both computationally and statistically) with general value function approximations. Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting.
arXiv Detail & Related papers (2023-02-22T20:21:25Z)
Provably Safe Reinforcement Learning with Step-wise Violation Constraints [26.020907891512596]
We consider stricter step-wise violation constraints and do not assume the existence of safe actions. We propose a novel algorithm SUCBVI, which guarantees $widetildeO(sqrtST)$ step-wise violation and $widetildeO(sqrtH3SAT)$ regret. We also study a novel safe reward-free exploration problem with step-wise violation constraints.
arXiv Detail & Related papers (2023-02-13T02:56:04Z)
Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$. We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes. We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z)
Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear Contextual Bandits and Markov Decision Processes [59.61248760134937]
We propose an efficient algorithm to achieve a regret of $tildeO(sqrtT+zeta)$. The proposed algorithm relies on the recently developed uncertainty-weighted least-squares regression from linear contextual bandit. We generalize our algorithm to the episodic MDP setting and first achieve an additive dependence on the corruption level $zeta$.
arXiv Detail & Related papers (2022-12-12T15:04:56Z)
Safe Exploration for Constrained Reinforcement Learning with Provable Guarantees [2.379828460137829]
We propose a model-based safe RL algorithm that we call the Optimistic-Pessimistic Safe Reinforcement Learning (OPSRL) algorithm. We show that it achieves an $tildemathcalO(S2sqrtA H7K/ (barC - barC_b)$ cumulative regret without violating the safety constraints during learning.
arXiv Detail & Related papers (2021-12-01T23:21:48Z)
Safe Reinforcement Learning with Linear Function Approximation [48.75026009895308]
We introduce safety as an unknown linear cost function of states and actions, which must always fall below a certain threshold. We then present algorithms, termed SLUCB-QVI and RSLUCB-QVI, for episodic Markov decision processes (MDPs) with linear function approximation. We show that SLUCB-QVI and RSLUCB-QVI, while with emphno safety violation, achieve a $tildemathcalOleft(kappasqrtd3H3Tright)$ regret, nearly matching
arXiv Detail & Related papers (2021-06-11T08:46:57Z)
Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation. We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.