Safe POMDP Online Planning via Shielding
- URL: http://arxiv.org/abs/2309.10216v2
- Date: Sat, 2 Mar 2024 15:49:21 GMT
- Title: Safe POMDP Online Planning via Shielding
- Authors: Shili Sheng, David Parker and Lu Feng
- Abstract summary: Partially observable Markov decision processes (POMDPs) have been widely used in many robotic applications for sequential decision-making under uncertainty.
POMDP online planning algorithms such as Partially Observable Monte-Carlo Planning (POMCP) can solve very large POMDPs with the goal of maximizing the expected return.
But the resulting policies cannot provide safety guarantees which are imperative for real-world safety-critical tasks.
- Score: 6.234405592444883
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Partially observable Markov decision processes (POMDPs) have been widely used
in many robotic applications for sequential decision-making under uncertainty.
POMDP online planning algorithms such as Partially Observable Monte-Carlo
Planning (POMCP) can solve very large POMDPs with the goal of maximizing the
expected return. But the resulting policies cannot provide safety guarantees
which are imperative for real-world safety-critical tasks (e.g., autonomous
driving). In this work, we consider safety requirements represented as
almost-sure reach-avoid specifications (i.e., the probability to reach a set of
goal states is one and the probability to reach a set of unsafe states is
zero). We compute shields that restrict unsafe actions which would violate the
almost-sure reach-avoid specifications. We then integrate these shields into
the POMCP algorithm for safe POMDP online planning. We propose four distinct
shielding methods, differing in how the shields are computed and integrated,
including factored variants designed to improve scalability. Experimental
results on a set of benchmark domains demonstrate that the proposed shielding
methods successfully guarantee safety (unlike the baseline POMCP without
shielding) on large POMDPs, with negligible impact on the runtime for online
planning.
Related papers
- ConstrainedZero: Chance-Constrained POMDP Planning using Learned Probabilistic Failure Surrogates and Adaptive Safety Constraints [34.9739641898452]
This work introduces the ConstrainedZero policy algorithm that solves CC-POMDPs in belief space by learning neural network approximations of the optimal value and policy.
Results show that by separating safety constraints from the objective we can achieve a target level of safety without optimizing the balance between rewards and costs.
arXiv Detail & Related papers (2024-05-01T17:17:22Z) - Learning Logic Specifications for Policy Guidance in POMDPs: an
Inductive Logic Programming Approach [57.788675205519986]
We learn high-quality traces from POMDP executions generated by any solver.
We exploit data- and time-efficient Indu Logic Programming (ILP) to generate interpretable belief-based policy specifications.
We show that learneds expressed in Answer Set Programming (ASP) yield performance superior to neural networks and similar to optimal handcrafted task-specifics within lower computational time.
arXiv Detail & Related papers (2024-02-29T15:36:01Z) - Safety Margins for Reinforcement Learning [74.13100479426424]
We show how to leverage proxy criticality metrics to generate safety margins.
We evaluate our approach on learned policies from APE-X and A3C within an Atari environment.
arXiv Detail & Related papers (2023-07-25T16:49:54Z) - Model-based Dynamic Shielding for Safe and Efficient Multi-Agent
Reinforcement Learning [7.103977648997475]
Multi-Agent Reinforcement Learning (MARL) discovers policies that maximize reward but do not have safety guarantees during the learning and deployment phases.
Model-based Dynamic Shielding (MBDS) to support MARL algorithm design.
arXiv Detail & Related papers (2023-04-13T06:08:10Z) - Online Shielding for Reinforcement Learning [59.86192283565134]
We propose an approach for online safety shielding of RL agents.
During runtime, the shield analyses the safety of each available action.
Based on this probability and a given threshold, the shield decides whether to block an action from the agent.
arXiv Detail & Related papers (2022-12-04T16:00:29Z) - Risk-Averse Decision Making Under Uncertainty [18.467950783426947]
A large class of decision making under uncertainty problems can be described via Markov decision processes (MDPs) or partially observable MDPs (POMDPs)
In this paper, we consider the problem of designing policies for MDPs and POMDPs with objectives and constraints in terms of dynamic coherent risk measures.
arXiv Detail & Related papers (2021-09-09T07:52:35Z) - Safe RAN control: A Symbolic Reinforcement Learning Approach [62.997667081978825]
We present a Symbolic Reinforcement Learning (SRL) based architecture for safety control of Radio Access Network (RAN) applications.
We provide a purely automated procedure in which a user can specify high-level logical safety specifications for a given cellular network topology.
We introduce a user interface (UI) developed to help a user set intent specifications to the system, and inspect the difference in agent proposed actions.
arXiv Detail & Related papers (2021-06-03T16:45:40Z) - Rule-based Shielding for Partially Observable Monte-Carlo Planning [78.05638156687343]
We propose two contributions to Partially Observable Monte-Carlo Planning (POMCP)
The first is a method for identifying unexpected actions selected by POMCP with respect to expert prior knowledge of the task.
The second is a shielding approach that prevents POMCP from selecting unexpected actions.
We evaluate our approach on Tiger, a standard benchmark for POMDPs, and a real-world problem related to velocity regulation in mobile robot navigation.
arXiv Detail & Related papers (2021-04-28T14:23:38Z) - Modular Deep Reinforcement Learning for Continuous Motion Planning with
Temporal Logic [59.94347858883343]
This paper investigates the motion planning of autonomous dynamical systems modeled by Markov decision processes (MDP)
The novelty is to design an embedded product MDP (EP-MDP) between the LDGBA and the MDP.
The proposed LDGBA-based reward shaping and discounting schemes for the model-free reinforcement learning (RL) only depend on the EP-MDP states.
arXiv Detail & Related papers (2021-02-24T01:11:25Z) - Safe reinforcement learning for probabilistic reachability and safety
specifications: A Lyapunov-based approach [2.741266294612776]
We propose a model-free safety specification method that learns the maximal probability of safe operation.
Our approach constructs a Lyapunov function with respect to a safe policy to restrain each policy improvement stage.
It yields a sequence of safe policies that determine the range of safe operation, called the safe set.
arXiv Detail & Related papers (2020-02-24T09:20:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.