Safe Reinforcement Learning in Constrained Markov Decision Processes
- URL: http://arxiv.org/abs/2008.06626v1
- Date: Sat, 15 Aug 2020 02:20:23 GMT
- Title: Safe Reinforcement Learning in Constrained Markov Decision Processes
- Authors: Akifumi Wachi and Yanan Sui
- Abstract summary: We propose an algorithm, SNO-MDP, that explores and optimize Markov decision processes under unknown safety constraints.
We provide theoretical guarantees on both the satisfaction of the safety constraint and the near-optimality of the cumulative reward.
- Score: 20.175139766171277
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Safe reinforcement learning has been a promising approach for optimizing the
policy of an agent that operates in safety-critical applications. In this
paper, we propose an algorithm, SNO-MDP, that explores and optimizes Markov
decision processes under unknown safety constraints. Specifically, we take a
stepwise approach for optimizing safety and cumulative reward. In our method,
the agent first learns safety constraints by expanding the safe region, and
then optimizes the cumulative reward in the certified safe region. We provide
theoretical guarantees on both the satisfaction of the safety constraint and
the near-optimality of the cumulative reward under proper regularity
assumptions. In our experiments, we demonstrate the effectiveness of SNO-MDP
through two experiments: one uses a synthetic data in a new, openly-available
environment named GP-SAFETY-GYM, and the other simulates Mars surface
exploration by using real observation data.
Related papers
- Balance Reward and Safety Optimization for Safe Reinforcement Learning: A Perspective of Gradient Manipulation [26.244121960815907]
Managing the trade-off between reward and safety during exploration presents a significant challenge.
In this study, we aim to address this conflicting relation by leveraging the theory of gradient manipulation.
Experimental results demonstrate that our algorithms outperform several state-of-the-art baselines in terms of balancing reward and safety optimization.
arXiv Detail & Related papers (2024-05-02T19:07:14Z) - SCPO: Safe Reinforcement Learning with Safety Critic Policy Optimization [1.3597551064547502]
This study introduces a novel safe reinforcement learning algorithm, Safety Critic Policy Optimization.
In this study, we define the safety critic, a mechanism that nullifies rewards obtained through violating safety constraints.
Our theoretical analysis indicates that the proposed algorithm can automatically balance the trade-off between adhering to safety constraints and maximizing rewards.
arXiv Detail & Related papers (2023-11-01T22:12:50Z) - Safe Exploration in Reinforcement Learning: A Generalized Formulation
and Algorithms [8.789204441461678]
We present a solution of the safe exploration (GSE) problem in the form of a meta-algorithm for safe exploration, MASE.
Our proposed algorithm achieves better performance than state-of-the-art algorithms on grid-world and Safety Gym benchmarks.
arXiv Detail & Related papers (2023-10-05T00:47:09Z) - Safe Reinforcement Learning From Pixels Using a Stochastic Latent
Representation [3.5884936187733394]
We address the problem of safe reinforcement learning from pixel observations.
We formalize the problem in a constrained, partially observable Markov decision process framework.
We employ a novel safety critic using the latent actor-critic (SLAC) approach.
arXiv Detail & Related papers (2022-10-02T19:55:42Z) - Log Barriers for Safe Black-box Optimization with Application to Safe
Reinforcement Learning [72.97229770329214]
We introduce a general approach for seeking high dimensional non-linear optimization problems in which maintaining safety during learning is crucial.
Our approach called LBSGD is based on applying a logarithmic barrier approximation with a carefully chosen step size.
We demonstrate the effectiveness of our approach on minimizing violation in policy tasks in safe reinforcement learning.
arXiv Detail & Related papers (2022-07-21T11:14:47Z) - Constrained Policy Optimization via Bayesian World Models [79.0077602277004]
LAMBDA is a model-based approach for policy optimization in safety critical tasks modeled via constrained Markov decision processes.
We demonstrate LAMBDA's state of the art performance on the Safety-Gym benchmark suite in terms of sample efficiency and constraint violation.
arXiv Detail & Related papers (2022-01-24T17:02:22Z) - Safe Policy Optimization with Local Generalized Linear Function
Approximations [17.84511819022308]
Existing safe exploration methods guaranteed safety under the assumption of regularity.
We propose a novel algorithm, SPO-LF, that optimize an agent's policy while learning the relation between a locally available feature obtained by sensors and environmental reward/safety.
We experimentally show that our algorithm is 1) more efficient in terms of sample complexity and computational cost and 2) more applicable to large-scale problems than previous safe RL methods with theoretical guarantees.
arXiv Detail & Related papers (2021-11-09T00:47:50Z) - Provably Safe PAC-MDP Exploration Using Analogies [87.41775218021044]
Key challenge in applying reinforcement learning to safety-critical domains is understanding how to balance exploration and safety.
We propose Analogous Safe-state Exploration (ASE), an algorithm for provably safe exploration in MDPs with unknown, dynamics.
Our method exploits analogies between state-action pairs to safely learn a near-optimal policy in a PAC-MDP sense.
arXiv Detail & Related papers (2020-07-07T15:50:50Z) - SAMBA: Safe Model-Based & Active Reinforcement Learning [59.01424351231993]
SAMBA is a framework for safe reinforcement learning that combines aspects from probabilistic modelling, information theory, and statistics.
We evaluate our algorithm on a variety of safe dynamical system benchmarks involving both low and high-dimensional state representations.
We provide intuition as to the effectiveness of the framework by a detailed analysis of our active metrics and safety constraints.
arXiv Detail & Related papers (2020-06-12T10:40:46Z) - Chance-Constrained Trajectory Optimization for Safe Exploration and
Learning of Nonlinear Systems [81.7983463275447]
Learning-based control algorithms require data collection with abundant supervision for training.
We present a new approach for optimal motion planning with safe exploration that integrates chance-constrained optimal control with dynamics learning and feedback control.
arXiv Detail & Related papers (2020-05-09T05:57:43Z) - Cautious Reinforcement Learning with Logical Constraints [78.96597639789279]
An adaptive safe padding forces Reinforcement Learning (RL) to synthesise optimal control policies while ensuring safety during the learning process.
Theoretical guarantees are available on the optimality of the synthesised policies and on the convergence of the learning algorithm.
arXiv Detail & Related papers (2020-02-26T00:01:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.