Learning to Act Safely with Limited Exposure and Almost Sure Certainty
- URL: http://arxiv.org/abs/2105.08748v1
- Date: Tue, 18 May 2021 18:05:12 GMT
- Title: Learning to Act Safely with Limited Exposure and Almost Sure Certainty
- Authors: Agustin Castellano, Hancheng Min, Juan Bazerque, Enrique Mallada
- Abstract summary: This paper aims to put forward the concept that learning to take safe actions in unknown environments, even with probability one guarantees, can be achieved without the need for exploratory trials.
We first focus on the canonical multi-armed bandit problem and seek to study the intrinsic trade-offs of learning safety in the presence of uncertainty.
- Score: 1.0323063834827415
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper aims to put forward the concept that learning to take safe actions
in unknown environments, even with probability one guarantees, can be achieved
without the need for an unbounded number of exploratory trials, provided that
one is willing to navigate trade-offs between optimality, level of exposure to
unsafe events, and the maximum detection time of unsafe actions. We illustrate
this concept in two complementary settings. We first focus on the canonical
multi-armed bandit problem and seek to study the intrinsic trade-offs of
learning safety in the presence of uncertainty. Under mild assumptions on
sufficient exploration, we provide an algorithm that provably detects all
unsafe machines in an (expected) finite number of rounds. The analysis also
unveils a trade-off between the number of rounds needed to secure the
environment and the probability of discarding safe machines. We then consider
the problem of finding optimal policies for a Markov Decision Process (MDP)
with almost sure constraints. We show that the (action) value function
satisfies a barrier-based decomposition which allows for the identification of
feasible policies independently of the reward process. Using this
decomposition, we develop a Barrier-learning algorithm, that identifies such
unsafe state-action pairs in a finite expected number of steps. Our analysis
further highlights a trade-off between the time lag for the underlying MDP
necessary to detect unsafe actions, and the level of exposure to unsafe events.
Simulations corroborate our theoretical findings, further illustrating the
aforementioned trade-offs, and suggesting that safety constraints can further
speed up the learning process.
Related papers
- Criticality and Safety Margins for Reinforcement Learning [53.10194953873209]
We seek to define a criticality framework with both a quantifiable ground truth and a clear significance to users.
We introduce true criticality as the expected drop in reward when an agent deviates from its policy for n consecutive random actions.
We also introduce the concept of proxy criticality, a low-overhead metric that has a statistically monotonic relationship to true criticality.
arXiv Detail & Related papers (2024-09-26T21:00:45Z) - Physics-informed RL for Maximal Safety Probability Estimation [0.8287206589886881]
We study how to estimate the long-term safety probability of maximally safe actions without sufficient coverage of samples from risky states and long-term trajectories.
The proposed method can also estimate long-term risk using short-term samples and deduce the risk of unsampled states.
arXiv Detail & Related papers (2024-03-25T03:13:56Z) - Safeguarded Progress in Reinforcement Learning: Safe Bayesian
Exploration for Control Policy Synthesis [63.532413807686524]
This paper addresses the problem of maintaining safety during training in Reinforcement Learning (RL)
We propose a new architecture that handles the trade-off between efficient progress and safety during exploration.
arXiv Detail & Related papers (2023-12-18T16:09:43Z) - Safety Margins for Reinforcement Learning [53.10194953873209]
We show how to leverage proxy criticality metrics to generate safety margins.
We evaluate our approach on learned policies from APE-X and A3C within an Atari environment.
arXiv Detail & Related papers (2023-07-25T16:49:54Z) - Provable Safe Reinforcement Learning with Binary Feedback [62.257383728544006]
We consider the problem of provable safe RL when given access to an offline oracle providing binary feedback on the safety of state, action pairs.
We provide a novel meta algorithm, SABRE, which can be applied to any MDP setting given access to a blackbox PAC RL algorithm for that setting.
arXiv Detail & Related papers (2022-10-26T05:37:51Z) - Safe Exploration Method for Reinforcement Learning under Existence of
Disturbance [1.1470070927586016]
We deal with a safe exploration problem in reinforcement learning under the existence of disturbance.
We propose a safe exploration method that uses partial prior knowledge of a controlled object and disturbance.
We illustrate the validity and effectiveness of the proposed method through numerical simulations of an inverted pendulum and a four-bar parallel link robot manipulator.
arXiv Detail & Related papers (2022-09-30T13:00:33Z) - Recursively Feasible Probabilistic Safe Online Learning with Control Barrier Functions [60.26921219698514]
We introduce a model-uncertainty-aware reformulation of CBF-based safety-critical controllers.
We then present the pointwise feasibility conditions of the resulting safety controller.
We use these conditions to devise an event-triggered online data collection strategy.
arXiv Detail & Related papers (2022-08-23T05:02:09Z) - Log Barriers for Safe Black-box Optimization with Application to Safe
Reinforcement Learning [72.97229770329214]
We introduce a general approach for seeking high dimensional non-linear optimization problems in which maintaining safety during learning is crucial.
Our approach called LBSGD is based on applying a logarithmic barrier approximation with a carefully chosen step size.
We demonstrate the effectiveness of our approach on minimizing violation in policy tasks in safe reinforcement learning.
arXiv Detail & Related papers (2022-07-21T11:14:47Z) - Safe Exploration Incurs Nearly No Additional Sample Complexity for
Reward-free RL [43.672794342894946]
Reward-free reinforcement learning (RF-RL) relies on random action-taking to explore the unknown environment without any reward feedback information.
It remains unclear how such safe exploration requirement would affect the corresponding sample complexity in order to achieve the desired optimality of the obtained policy in planning.
We propose a unified Safe reWard-frEe ExploraTion (SWEET) framework, and develop algorithms coined Tabular-SWEET and Low-rank-SWEET, respectively.
arXiv Detail & Related papers (2022-06-28T15:00:45Z) - Fail-Safe Adversarial Generative Imitation Learning [9.594432031144716]
We propose a safety layer that enables a closed-form probability density/gradient of the safe generative continuous policy, end-to-end generative adversarial training, and worst-case safety guarantees.
The safety layer maps all actions into a set of safe actions, and uses the change-of-variables formula plus additivity of measures for the density.
In an experiment on real-world driver interaction data, we empirically demonstrate tractability, safety and imitation performance of our approach.
arXiv Detail & Related papers (2022-03-03T13:03:06Z) - Learning to be safe, in finite time [4.189643331553922]
This paper aims to put forward the concept that learning to take safe actions in unknown environments, even with probability one guarantees, can be achieved without the need for an unbounded number of exploratory trials.
We focus on the canonical multi-armed bandit problem and seek to study the exploration-preservation trade-off intrinsic within safe learning.
arXiv Detail & Related papers (2020-10-01T14:03:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.