Safe Reinforcement Learning with Instantaneous Constraints: The Role of
Aggressive Exploration
- URL: http://arxiv.org/abs/2312.14470v1
- Date: Fri, 22 Dec 2023 06:45:45 GMT
- Title: Safe Reinforcement Learning with Instantaneous Constraints: The Role of
Aggressive Exploration
- Authors: Honghao Wei, Xin Liu, Lei Ying
- Abstract summary: We study safe Reinforcement Learning (safe RL) with linear function approximation and under hard instantaneous constraints.
Our proposed algorithm, LSVI-AE, achieves $tildecO(sqrtd3H4K)$ hard constraint violation when the cost function is linear and $cO(Hgamma_K sqrtK)$ hard constraint violation when the cost function belongs to RKHS.
- Score: 20.630973009400574
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies safe Reinforcement Learning (safe RL) with linear function
approximation and under hard instantaneous constraints where unsafe actions
must be avoided at each step. Existing studies have considered safe RL with
hard instantaneous constraints, but their approaches rely on several key
assumptions: $(i)$ the RL agent knows a safe action set for {\it every} state
or knows a {\it safe graph} in which all the state-action-state triples are
safe, and $(ii)$ the constraint/cost functions are {\it linear}. In this paper,
we consider safe RL with instantaneous hard constraints without assumption
$(i)$ and generalize $(ii)$ to Reproducing Kernel Hilbert Space (RKHS). Our
proposed algorithm, LSVI-AE, achieves $\tilde{\cO}(\sqrt{d^3H^4K})$ regret and
$\tilde{\cO}(H \sqrt{dK})$ hard constraint violation when the cost function is
linear and $\cO(H\gamma_K \sqrt{K})$ hard constraint violation when the cost
function belongs to RKHS. Here $K$ is the learning horizon, $H$ is the length
of each episode, and $\gamma_K$ is the information gain w.r.t the kernel used
to approximate cost functions. Our results achieve the optimal dependency on
the learning horizon $K$, matching the lower bound we provide in this paper and
demonstrating the efficiency of LSVI-AE. Notably, the design of our approach
encourages aggressive policy exploration, providing a unique perspective on
safe RL with general cost functions and no prior knowledge of safe actions,
which may be of independent interest.
Related papers
- Provably Efficient Action-Manipulation Attack Against Continuous Reinforcement Learning [49.48615590763914]
We propose a black-box attack algorithm named LCBT, which uses the Monte Carlo tree search method for efficient action searching and manipulation.
We conduct our proposed attack methods on three aggressive algorithms: DDPG, PPO, and TD3 in continuous settings, which show a promising attack performance.
arXiv Detail & Related papers (2024-11-20T08:20:29Z) - Provably Safe Reinforcement Learning with Step-wise Violation
Constraints [26.020907891512596]
We consider stricter step-wise violation constraints and do not assume the existence of safe actions.
We propose a novel algorithm SUCBVI, which guarantees $widetildeO(sqrtST)$ step-wise violation and $widetildeO(sqrtH3SAT)$ regret.
We also study a novel safe reward-free exploration problem with step-wise violation constraints.
arXiv Detail & Related papers (2023-02-13T02:56:04Z) - A Near-Optimal Algorithm for Safe Reinforcement Learning Under
Instantaneous Hard Constraints [43.895798638743784]
We develop the first near-optimal safe RL algorithm for episodic Markov Decision Processes with unsafe states and actions.
It achieves a regret $tildeO(fracd H3 sqrtdKDelta_c)$ that tightly matches the state-of-the-art regret in the setting.
We also provide a lower bound $tildeOmega(maxdH sqrtK, fracHDelta_c2)$, which indicates that the dependency on $
arXiv Detail & Related papers (2023-02-08T23:42:04Z) - Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$.
We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes.
We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z) - Safe Exploration for Constrained Reinforcement Learning with Provable
Guarantees [2.379828460137829]
We propose a model-based safe RL algorithm that we call the Optimistic-Pessimistic Safe Reinforcement Learning (OPSRL) algorithm.
We show that it achieves an $tildemathcalO(S2sqrtA H7K/ (barC - barC_b)$ cumulative regret without violating the safety constraints during learning.
arXiv Detail & Related papers (2021-12-01T23:21:48Z) - Safe Reinforcement Learning with Linear Function Approximation [48.75026009895308]
We introduce safety as an unknown linear cost function of states and actions, which must always fall below a certain threshold.
We then present algorithms, termed SLUCB-QVI and RSLUCB-QVI, for episodic Markov decision processes (MDPs) with linear function approximation.
We show that SLUCB-QVI and RSLUCB-QVI, while with emphno safety violation, achieve a $tildemathcalOleft(kappasqrtd3H3Tright)$ regret, nearly matching
arXiv Detail & Related papers (2021-06-11T08:46:57Z) - Safe Learning under Uncertain Objectives and Constraints [66.05180398174286]
We consider non-textitunknown yet safety-critical optimization problems under textitunknown yet safety-critical constraints.
Such problems naturally arise in a variety of domains including robotics, manufacturing, and medical procedures.
A crucial component of our analysis is to introduce and apply a technique called shrinkage in the context of safe optimization.
arXiv Detail & Related papers (2020-06-23T20:51:00Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.