Related papers: Safe Policy Improvement Approaches and their Limitations

Safe Policy Improvement Approaches and their Limitations

URL: http://arxiv.org/abs/2208.00724v1
Date: Mon, 1 Aug 2022 10:13:03 GMT
Title: Safe Policy Improvement Approaches and their Limitations
Authors: Philipp Scholl, Felix Dietrich, Clemens Otte, Steffen Udluft
Abstract summary: We classify various Safe Policy Improvement (SPI) approaches from the literature into two groups, based on how they utilize the uncertainty of state-action pairs. We show that their claim of being provably safe does not hold. We develop adaptations, the Adv.-Soft-SPIBB algorithms, and show that they are provably safe.
Score: 2.596059386610301
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Safe Policy Improvement (SPI) is an important technique for offline reinforcement learning in safety critical applications as it improves the behavior policy with a high probability. We classify various SPI approaches from the literature into two groups, based on how they utilize the uncertainty of state-action pairs. Focusing on the Soft-SPIBB (Safe Policy Improvement with Soft Baseline Bootstrapping) algorithms, we show that their claim of being provably safe does not hold. Based on this finding, we develop adaptations, the Adv-Soft-SPIBB algorithms, and show that they are provably safe. A heuristic adaptation, Lower-Approx-Soft-SPIBB, yields the best performance among all SPIBB algorithms in extensive experiments on two benchmarks. We also check the safety guarantees of the provably safe algorithms and show that huge amounts of data are necessary such that the safety bounds become useful in practice.

Related papers

SCPO: Safe Reinforcement Learning with Safety Critic Policy Optimization [1.3597551064547502]
This study introduces a novel safe reinforcement learning algorithm, Safety Critic Policy Optimization. In this study, we define the safety critic, a mechanism that nullifies rewards obtained through violating safety constraints. Our theoretical analysis indicates that the proposed algorithm can automatically balance the trade-off between adhering to safety constraints and maximizing rewards.
arXiv Detail & Related papers (2023-11-01T22:12:50Z)
More for Less: Safe Policy Improvement With Stronger Performance Guarantees [7.507789621505201]
The safe policy improvement (SPI) problem aims to improve the performance of a behavior policy according to which sample data has been generated. We present a novel approach to the SPI problem that provides the means to require less data for such guarantees.
arXiv Detail & Related papers (2023-05-13T16:22:21Z)
Safety Correction from Baseline: Towards the Risk-aware Policy in Robotics via Dual-agent Reinforcement Learning [64.11013095004786]
We propose a dual-agent safe reinforcement learning strategy consisting of a baseline and a safe agent. Such a decoupled framework enables high flexibility, data efficiency and risk-awareness for RL-based control. The proposed method outperforms the state-of-the-art safe RL algorithms on difficult robot locomotion and manipulation tasks.
arXiv Detail & Related papers (2022-12-14T03:11:25Z)
Evaluating Model-free Reinforcement Learning toward Safety-critical Tasks [70.76757529955577]
This paper revisits prior work in this scope from the perspective of state-wise safe RL. We propose Unrolling Safety Layer (USL), a joint method that combines safety optimization and safety projection. To facilitate further research in this area, we reproduce related algorithms in a unified pipeline and incorporate them into SafeRL-Kit.
arXiv Detail & Related papers (2022-12-12T06:30:17Z)
Provable Safe Reinforcement Learning with Binary Feedback [62.257383728544006]
We consider the problem of provable safe RL when given access to an offline oracle providing binary feedback on the safety of state, action pairs. We provide a novel meta algorithm, SABRE, which can be applied to any MDP setting given access to a blackbox PAC RL algorithm for that setting.
arXiv Detail & Related papers (2022-10-26T05:37:51Z)
Log Barriers for Safe Black-box Optimization with Application to Safe Reinforcement Learning [72.97229770329214]
We introduce a general approach for seeking high dimensional non-linear optimization problems in which maintaining safety during learning is crucial. Our approach called LBSGD is based on applying a logarithmic barrier approximation with a carefully chosen step size. We demonstrate the effectiveness of our approach on minimizing violation in policy tasks in safe reinforcement learning.
arXiv Detail & Related papers (2022-07-21T11:14:47Z)
Safe Policy Improvement Approaches on Discrete Markov Decision Processes [2.596059386610301]
Safe Policy Improvement (SPI) aims at provable guarantees that a learned policy is at least approximately as good as a given baseline policy. We derive a new algorithm that is provably safe on finite Markov Decision Processes (MDP)
arXiv Detail & Related papers (2022-01-28T15:16:54Z)
Safe Policy Optimization with Local Generalized Linear Function Approximations [17.84511819022308]
Existing safe exploration methods guaranteed safety under the assumption of regularity. We propose a novel algorithm, SPO-LF, that optimize an agent's policy while learning the relation between a locally available feature obtained by sensors and environmental reward/safety. We experimentally show that our algorithm is 1) more efficient in terms of sample complexity and computational cost and 2) more applicable to large-scale problems than previous safe RL methods with theoretical guarantees.
arXiv Detail & Related papers (2021-11-09T00:47:50Z)
Multi-Objective SPIBB: Seldonian Offline Policy Improvement with Safety Constraints in Finite MDPs [71.47895794305883]
We study the problem of Safe Policy Improvement (SPI) under constraints in the offline Reinforcement Learning setting. We present an SPI for this RL setting that takes into account the preferences of the algorithm's user for handling the trade-offs for different reward signals.
arXiv Detail & Related papers (2021-05-31T21:04:21Z)
Safe Reinforcement Learning in Constrained Markov Decision Processes [20.175139766171277]
We propose an algorithm, SNO-MDP, that explores and optimize Markov decision processes under unknown safety constraints. We provide theoretical guarantees on both the satisfaction of the safety constraint and the near-optimality of the cumulative reward.
arXiv Detail & Related papers (2020-08-15T02:20:23Z)
Provably Safe PAC-MDP Exploration Using Analogies [87.41775218021044]
Key challenge in applying reinforcement learning to safety-critical domains is understanding how to balance exploration and safety. We propose Analogous Safe-state Exploration (ASE), an algorithm for provably safe exploration in MDPs with unknown, dynamics. Our method exploits analogies between state-action pairs to safely learn a near-optimal policy in a PAC-MDP sense.
arXiv Detail & Related papers (2020-07-07T15:50:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.