Related papers: Safe Reinforcement Learning with Instantaneous Constraints: The Role of Aggressive Exploration

Safe Reinforcement Learning with Instantaneous Constraints: The Role of Aggressive Exploration

URL: http://arxiv.org/abs/2312.14470v1
Date: Fri, 22 Dec 2023 06:45:45 GMT
Title: Safe Reinforcement Learning with Instantaneous Constraints: The Role of Aggressive Exploration
Authors: Honghao Wei, Xin Liu, Lei Ying
Abstract summary: We study safe Reinforcement Learning (safe RL) with linear function approximation and under hard instantaneous constraints. Our proposed algorithm, LSVI-AE, achieves $tildecO(sqrtd3H4K)$ hard constraint violation when the cost function is linear and $cO(Hgamma_K sqrtK)$ hard constraint violation when the cost function belongs to RKHS.
Score: 20.630973009400574
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper studies safe Reinforcement Learning (safe RL) with linear function approximation and under hard instantaneous constraints where unsafe actions must be avoided at each step. Existing studies have considered safe RL with hard instantaneous constraints, but their approaches rely on several key assumptions: $(i)$ the RL agent knows a safe action set for {\it every} state or knows a {\it safe graph} in which all the state-action-state triples are safe, and $(ii)$ the constraint/cost functions are {\it linear}. In this paper, we consider safe RL with instantaneous hard constraints without assumption $(i)$ and generalize $(ii)$ to Reproducing Kernel Hilbert Space (RKHS). Our proposed algorithm, LSVI-AE, achieves $\tilde{\cO}(\sqrt{d^3H^4K})$ regret and $\tilde{\cO}(H \sqrt{dK})$ hard constraint violation when the cost function is linear and $\cO(H\gamma_K \sqrt{K})$ hard constraint violation when the cost function belongs to RKHS. Here $K$ is the learning horizon, $H$ is the length of each episode, and $\gamma_K$ is the information gain w.r.t the kernel used to approximate cost functions. Our results achieve the optimal dependency on the learning horizon $K$, matching the lower bound we provide in this paper and demonstrating the efficiency of LSVI-AE. Notably, the design of our approach encourages aggressive policy exploration, providing a unique perspective on safe RL with general cost functions and no prior knowledge of safe actions, which may be of independent interest.

Related papers

Provably Efficient RL for Linear MDPs under Instantaneous Safety Constraints in Non-Convex Feature Spaces [27.07625013423198]
In Reinforcement tasks with constraints, it is important to avoid errors. In this paper, we develop a novel technique to properly resolve the constraints.
arXiv Detail & Related papers (2025-02-25T21:32:55Z)
Provably Efficient Action-Manipulation Attack Against Continuous Reinforcement Learning [49.48615590763914]
We propose a black-box attack algorithm named LCBT, which uses the Monte Carlo tree search method for efficient action searching and manipulation. We conduct our proposed attack methods on three aggressive algorithms: DDPG, PPO, and TD3 in continuous settings, which show a promising attack performance.
arXiv Detail & Related papers (2024-11-20T08:20:29Z)
Provably Safe Reinforcement Learning with Step-wise Violation Constraints [26.020907891512596]
We consider stricter step-wise violation constraints and do not assume the existence of safe actions. We propose a novel algorithm SUCBVI, which guarantees $widetildeO(sqrtST)$ step-wise violation and $widetildeO(sqrtH3SAT)$ regret. We also study a novel safe reward-free exploration problem with step-wise violation constraints.
arXiv Detail & Related papers (2023-02-13T02:56:04Z)
A Near-Optimal Algorithm for Safe Reinforcement Learning Under Instantaneous Hard Constraints [43.895798638743784]
We develop the first near-optimal safe RL algorithm for episodic Markov Decision Processes with unsafe states and actions. It achieves a regret $tildeO(fracd H3 sqrtdKDelta_c)$ that tightly matches the state-of-the-art regret in the setting. We also provide a lower bound $tildeOmega(maxdH sqrtK, fracHDelta_c2)$, which indicates that the dependency on $
arXiv Detail & Related papers (2023-02-08T23:42:04Z)
Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$. We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes. We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z)
Safe Exploration for Constrained Reinforcement Learning with Provable Guarantees [2.379828460137829]
We propose a model-based safe RL algorithm that we call the Optimistic-Pessimistic Safe Reinforcement Learning (OPSRL) algorithm. We show that it achieves an $tildemathcalO(S2sqrtA H7K/ (barC - barC_b)$ cumulative regret without violating the safety constraints during learning.
arXiv Detail & Related papers (2021-12-01T23:21:48Z)
Safe Reinforcement Learning with Linear Function Approximation [48.75026009895308]
We introduce safety as an unknown linear cost function of states and actions, which must always fall below a certain threshold. We then present algorithms, termed SLUCB-QVI and RSLUCB-QVI, for episodic Markov decision processes (MDPs) with linear function approximation. We show that SLUCB-QVI and RSLUCB-QVI, while with emphno safety violation, achieve a $tildemathcalOleft(kappasqrtd3H3Tright)$ regret, nearly matching
arXiv Detail & Related papers (2021-06-11T08:46:57Z)
Safe Learning under Uncertain Objectives and Constraints [66.05180398174286]
We consider non-textitunknown yet safety-critical optimization problems under textitunknown yet safety-critical constraints. Such problems naturally arise in a variety of domains including robotics, manufacturing, and medical procedures. A crucial component of our analysis is to introduce and apply a technique called shrinkage in the context of safe optimization.
arXiv Detail & Related papers (2020-06-23T20:51:00Z)
Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation. We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.