Model-based Safe Deep Reinforcement Learning via a Constrained Proximal
Policy Optimization Algorithm
- URL: http://arxiv.org/abs/2210.07573v1
- Date: Fri, 14 Oct 2022 06:53:02 GMT
- Title: Model-based Safe Deep Reinforcement Learning via a Constrained Proximal
Policy Optimization Algorithm
- Authors: Ashish Kumar Jayant, Shalabh Bhatnagar
- Abstract summary: We propose an On-policy Model-based Safe Deep RL algorithm in which we learn the transition dynamics of the environment in an online manner.
We show that our algorithm is more sample efficient and results in lower cumulative hazard violations as compared to constrained model-free approaches.
- Score: 4.128216503196621
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: During initial iterations of training in most Reinforcement Learning (RL)
algorithms, agents perform a significant number of random exploratory steps. In
the real world, this can limit the practicality of these algorithms as it can
lead to potentially dangerous behavior. Hence safe exploration is a critical
issue in applying RL algorithms in the real world. This problem has been
recently well studied under the Constrained Markov Decision Process (CMDP)
Framework, where in addition to single-stage rewards, an agent receives
single-stage costs or penalties as well depending on the state transitions. The
prescribed cost functions are responsible for mapping undesirable behavior at
any given time-step to a scalar value. The goal then is to find a feasible
policy that maximizes reward returns while constraining the cost returns to be
below a prescribed threshold during training as well as deployment.
We propose an On-policy Model-based Safe Deep RL algorithm in which we learn
the transition dynamics of the environment in an online manner as well as find
a feasible optimal policy using the Lagrangian Relaxation-based Proximal Policy
Optimization. We use an ensemble of neural networks with different
initializations to tackle epistemic and aleatoric uncertainty issues faced
during environment model learning. We compare our approach with relevant
model-free and model-based approaches in Constrained RL using the challenging
Safe Reinforcement Learning benchmark - the Open AI Safety Gym. We demonstrate
that our algorithm is more sample efficient and results in lower cumulative
hazard violations as compared to constrained model-free approaches. Further,
our approach shows better reward performance than other constrained model-based
approaches in the literature.
Related papers
- Sublinear Regret for a Class of Continuous-Time Linear--Quadratic Reinforcement Learning Problems [10.404992912881601]
We study reinforcement learning for a class of continuous-time linear-quadratic (LQ) control problems for diffusions.
We apply a model-free approach that relies neither on knowledge of model parameters nor on their estimations, and devise an actor-critic algorithm to learn the optimal policy parameter directly.
arXiv Detail & Related papers (2024-07-24T12:26:21Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Constrained Reinforcement Learning Under Model Mismatch [18.05296241839688]
Existing studies on constrained reinforcement learning (RL) may obtain a well-performing policy in the training environment.
However, when deployed in a real environment, it may easily violate constraints that were originally satisfied during training because there might be model mismatch between the training and real environments.
We develop a Robust Constrained Policy Optimization (RCPO) algorithm, which is the first algorithm that applies to large/continuous state space and has theoretical guarantees on worst-case reward improvement and constraint violation at each iteration during the training.
arXiv Detail & Related papers (2024-05-02T14:31:52Z) - Probabilistic Reach-Avoid for Bayesian Neural Networks [71.67052234622781]
We show that an optimal synthesis algorithm can provide more than a four-fold increase in the number of certifiable states.
The algorithm is able to provide more than a three-fold increase in the average guaranteed reach-avoid probability.
arXiv Detail & Related papers (2023-10-03T10:52:21Z) - Log Barriers for Safe Black-box Optimization with Application to Safe
Reinforcement Learning [72.97229770329214]
We introduce a general approach for seeking high dimensional non-linear optimization problems in which maintaining safety during learning is crucial.
Our approach called LBSGD is based on applying a logarithmic barrier approximation with a carefully chosen step size.
We demonstrate the effectiveness of our approach on minimizing violation in policy tasks in safe reinforcement learning.
arXiv Detail & Related papers (2022-07-21T11:14:47Z) - Safe Continuous Control with Constrained Model-Based Policy Optimization [0.0]
We introduce a model-based safe exploration algorithm for constrained high-dimensional control.
We also introduce a practical algorithm that accelerates policy search with model-generated data.
arXiv Detail & Related papers (2021-04-14T15:20:55Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Constrained Markov Decision Processes via Backward Value Functions [43.649330976089004]
We model the problem of learning with constraints as a Constrained Markov Decision Process.
A key contribution of our approach is to translate cumulative cost constraints into state-based constraints.
We provide theoretical guarantees under which the agent converges while ensuring safety over the course of training.
arXiv Detail & Related papers (2020-08-26T20:56:16Z) - Robust Reinforcement Learning using Least Squares Policy Iteration with
Provable Performance Guarantees [3.8073142980733]
This paper addresses the problem of model-free reinforcement learning for Robust Markov Decision Process (RMDP) with large state spaces.
We first propose the Robust Least Squares Policy Evaluation algorithm, which is a multi-step online model-free learning algorithm for policy evaluation.
We then propose Robust Least Squares Policy Iteration (RLSPI) algorithm for learning the optimal robust policy.
arXiv Detail & Related papers (2020-06-20T16:26:50Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z) - Guided Constrained Policy Optimization for Dynamic Quadrupedal Robot
Locomotion [78.46388769788405]
We introduce guided constrained policy optimization (GCPO), an RL framework based upon our implementation of constrained policy optimization (CPPO)
We show that guided constrained RL offers faster convergence close to the desired optimum resulting in an optimal, yet physically feasible, robotic control behavior without the need for precise reward function tuning.
arXiv Detail & Related papers (2020-02-22T10:15:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.