Online Shielding for Reinforcement Learning
- URL: http://arxiv.org/abs/2212.01861v1
- Date: Sun, 4 Dec 2022 16:00:29 GMT
- Title: Online Shielding for Reinforcement Learning
- Authors: Bettina K\"onighofer, Julian Rudolf, Alexander Palmisano, Martin
Tappler and Roderick Bloem
- Abstract summary: We propose an approach for online safety shielding of RL agents.
During runtime, the shield analyses the safety of each available action.
Based on this probability and a given threshold, the shield decides whether to block an action from the agent.
- Score: 59.86192283565134
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Besides the recent impressive results on reinforcement learning (RL), safety
is still one of the major research challenges in RL. RL is a machine-learning
approach to determine near-optimal policies in Markov decision processes
(MDPs). In this paper, we consider the setting where the safety-relevant
fragment of the MDP together with a temporal logic safety specification is
given and many safety violations can be avoided by planning ahead a short time
into the future. We propose an approach for online safety shielding of RL
agents. During runtime, the shield analyses the safety of each available
action. For any action, the shield computes the maximal probability to not
violate the safety specification within the next $k$ steps when executing this
action. Based on this probability and a given threshold, the shield decides
whether to block an action from the agent. Existing offline shielding
approaches compute exhaustively the safety of all state-action combinations
ahead of time, resulting in huge computation times and large memory
consumption. The intuition behind online shielding is to compute at runtime the
set of all states that could be reached in the near future. For each of these
states, the safety of all available actions is analysed and used for shielding
as soon as one of the considered states is reached. Our approach is well suited
for high-level planning problems where the time between decisions can be used
for safety computations and it is sustainable for the agent to wait until these
computations are finished. For our evaluation, we selected a 2-player version
of the classical computer game SNAKE. The game represents a high-level planning
problem that requires fast decisions and the multiplayer setting induces a
large state space, which is computationally expensive to analyse exhaustively.
Related papers
- Long-term Safe Reinforcement Learning with Binary Feedback [5.684409853507594]
Long-term Binary Safe RL (LoBiSaRL) is a safe RL algorithm for constrained Markov decision processes.
LoBiSaRL guarantees the long-term safety constraint, with high probability.
Our theoretical results show that LoBiSaRL guarantees the long-term safety constraint, with high probability.
arXiv Detail & Related papers (2024-01-08T10:07:31Z) - Safe POMDP Online Planning via Shielding [6.234405592444883]
Partially observable Markov decision processes (POMDPs) have been widely used in many robotic applications for sequential decision-making under uncertainty.
POMDP online planning algorithms such as Partially Observable Monte-Carlo Planning (POMCP) can solve very large POMDPs with the goal of maximizing the expected return.
But the resulting policies cannot provide safety guarantees which are imperative for real-world safety-critical tasks.
arXiv Detail & Related papers (2023-09-19T00:02:05Z) - Safety Margins for Reinforcement Learning [53.10194953873209]
We show how to leverage proxy criticality metrics to generate safety margins.
We evaluate our approach on learned policies from APE-X and A3C within an Atari environment.
arXiv Detail & Related papers (2023-07-25T16:49:54Z) - Approximate Shielding of Atari Agents for Safe Exploration [83.55437924143615]
We propose a principled algorithm for safe exploration based on the concept of shielding.
We present preliminary results that show our approximate shielding algorithm effectively reduces the rate of safety violations.
arXiv Detail & Related papers (2023-04-21T16:19:54Z) - Automata Learning meets Shielding [1.1417805445492082]
Safety is still one of the major research challenges in reinforcement learning (RL)
In this paper, we address the problem of how to avoid safety violations of RL agents during exploration in probabilistic and partially unknown environments.
Our approach combines automata learning for Markov Decision Processes (MDPs) and shield synthesis in an iterative approach.
arXiv Detail & Related papers (2022-12-04T14:58:12Z) - Provable Safe Reinforcement Learning with Binary Feedback [62.257383728544006]
We consider the problem of provable safe RL when given access to an offline oracle providing binary feedback on the safety of state, action pairs.
We provide a novel meta algorithm, SABRE, which can be applied to any MDP setting given access to a blackbox PAC RL algorithm for that setting.
arXiv Detail & Related papers (2022-10-26T05:37:51Z) - Safe Reinforcement Learning by Imagining the Near Future [37.0376099401243]
In this work, we focus on the setting where unsafe states can be avoided by planning ahead a short time into the future.
We devise a model-based algorithm that heavily penalizes unsafe trajectories, and derive guarantees that our algorithm can avoid unsafe states under certain assumptions.
Experiments demonstrate that our algorithm can achieve competitive rewards with fewer safety violations in several continuous control tasks.
arXiv Detail & Related papers (2022-02-15T23:28:24Z) - SAUTE RL: Almost Surely Safe Reinforcement Learning Using State
Augmentation [63.25418599322092]
Satisfying safety constraints almost surely (or with probability one) can be critical for deployment of Reinforcement Learning (RL) in real-life applications.
We address the problem by introducing Safety Augmented Markov Decision Processes (MDPs)
We show that Saute MDP allows to view Safe augmentation problem from a different perspective enabling new features.
arXiv Detail & Related papers (2022-02-14T08:57:01Z) - Safe Reinforcement Learning with Linear Function Approximation [48.75026009895308]
We introduce safety as an unknown linear cost function of states and actions, which must always fall below a certain threshold.
We then present algorithms, termed SLUCB-QVI and RSLUCB-QVI, for episodic Markov decision processes (MDPs) with linear function approximation.
We show that SLUCB-QVI and RSLUCB-QVI, while with emphno safety violation, achieve a $tildemathcalOleft(kappasqrtd3H3Tright)$ regret, nearly matching
arXiv Detail & Related papers (2021-06-11T08:46:57Z) - Learning to Act Safely with Limited Exposure and Almost Sure Certainty [1.0323063834827415]
This paper aims to put forward the concept that learning to take safe actions in unknown environments, even with probability one guarantees, can be achieved without the need for exploratory trials.
We first focus on the canonical multi-armed bandit problem and seek to study the intrinsic trade-offs of learning safety in the presence of uncertainty.
arXiv Detail & Related papers (2021-05-18T18:05:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.