CUP: A Conservative Update Policy Algorithm for Safe Reinforcement
Learning
- URL: http://arxiv.org/abs/2202.07565v1
- Date: Tue, 15 Feb 2022 16:49:28 GMT
- Title: CUP: A Conservative Update Policy Algorithm for Safe Reinforcement
Learning
- Authors: Long Yang, Jiaming Ji, Juntao Dai, Yu Zhang, Pengfei Li, Gang Pan
- Abstract summary: We propose a Conservative Update Policy with a theoretical safety guarantee.
We provide rigorous theoretical analysis to extend the surrogate functions to generalized advantage (GAE)
Experiments show the effectiveness of the CUP to design safe constraints.
- Score: 14.999515900425305
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Safe reinforcement learning (RL) is still very challenging since it requires
the agent to consider both return maximization and safe exploration. In this
paper, we propose CUP, a Conservative Update Policy algorithm with a
theoretical safety guarantee. We derive the CUP based on the new proposed
performance bounds and surrogate functions. Although using bounds as surrogate
functions to design safe RL algorithms have appeared in some existing works, we
develop them at least three aspects: (i) We provide a rigorous theoretical
analysis to extend the surrogate functions to generalized advantage estimator
(GAE). GAE significantly reduces variance empirically while maintaining a
tolerable level of bias, which is an efficient step for us to design CUP; (ii)
The proposed bounds are tighter than existing works, i.e., using the proposed
bounds as surrogate functions are better local approximations to the objective
and safety constraints. (iii) The proposed CUP provides a non-convex
implementation via first-order optimizers, which does not depend on any convex
approximation. Finally, extensive experiments show the effectiveness of CUP
where the agent satisfies safe constraints. We have opened the source code of
CUP at https://github.com/RL-boxes/Safe-RL.
Related papers
- One-Shot Safety Alignment for Large Language Models via Optimal Dualization [64.52223677468861]
This paper presents a perspective of dualization that reduces constrained alignment to an equivalent unconstrained alignment problem.
We do so by pre-optimizing a smooth and convex dual function that has a closed form.
Our strategy leads to two practical algorithms in model-based and preference-based settings.
arXiv Detail & Related papers (2024-05-29T22:12:52Z) - Safety Optimized Reinforcement Learning via Multi-Objective Policy
Optimization [3.425378723819911]
Safe reinforcement learning (Safe RL) refers to a class of techniques that aim to prevent RL algorithms from violating constraints.
In this paper, a novel model-free Safe RL algorithm, formulated based on the multi-objective policy optimization framework is introduced.
arXiv Detail & Related papers (2024-02-23T08:58:38Z) - SCPO: Safe Reinforcement Learning with Safety Critic Policy Optimization [1.3597551064547502]
This study introduces a novel safe reinforcement learning algorithm, Safety Critic Policy Optimization.
In this study, we define the safety critic, a mechanism that nullifies rewards obtained through violating safety constraints.
Our theoretical analysis indicates that the proposed algorithm can automatically balance the trade-off between adhering to safety constraints and maximizing rewards.
arXiv Detail & Related papers (2023-11-01T22:12:50Z) - A Multiplicative Value Function for Safe and Efficient Reinforcement
Learning [131.96501469927733]
We propose a safe model-free RL algorithm with a novel multiplicative value function consisting of a safety critic and a reward critic.
The safety critic predicts the probability of constraint violation and discounts the reward critic that only estimates constraint-free returns.
We evaluate our method in four safety-focused environments, including classical RL benchmarks augmented with safety constraints and robot navigation tasks with images and raw Lidar scans as observations.
arXiv Detail & Related papers (2023-03-07T18:29:15Z) - Constrained Update Projection Approach to Safe Policy Optimization [13.679149984354403]
We propose CUP, a novel policy optimization method based on Constrained Update Projection framework.
CUP unifies performance bounds, providing a better understanding and interpretability for some existing algorithms.
Experiments show the effectiveness of CUP both in terms of reward and safety satisfaction.
arXiv Detail & Related papers (2022-09-15T07:01:42Z) - Penalized Proximal Policy Optimization for Safe Reinforcement Learning [68.86485583981866]
We propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem.
P3O utilizes a simple-yet-effective penalty function to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective.
We show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.
arXiv Detail & Related papers (2022-05-24T06:15:51Z) - Safe Policy Optimization with Local Generalized Linear Function
Approximations [17.84511819022308]
Existing safe exploration methods guaranteed safety under the assumption of regularity.
We propose a novel algorithm, SPO-LF, that optimize an agent's policy while learning the relation between a locally available feature obtained by sensors and environmental reward/safety.
We experimentally show that our algorithm is 1) more efficient in terms of sample complexity and computational cost and 2) more applicable to large-scale problems than previous safe RL methods with theoretical guarantees.
arXiv Detail & Related papers (2021-11-09T00:47:50Z) - Model-Based Actor-Critic with Chance Constraint for Stochastic System [6.600423613245076]
We propose a model-based chance constrained actor-critic (CCAC) algorithm which can efficiently learn a safe and non-conservative policy.
CCAC directly solves the original chance constrained problems, where the objective function and safe probability is simultaneously optimized with adaptive weights.
arXiv Detail & Related papers (2020-12-19T15:46:50Z) - CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z) - Provably Good Batch Reinforcement Learning Without Great Exploration [51.51462608429621]
Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks.
Recent algorithms have shown promise but can still be overly optimistic in their expected outcomes.
We show that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees.
arXiv Detail & Related papers (2020-07-16T09:25:54Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.