Constrained Update Projection Approach to Safe Policy Optimization
- URL: http://arxiv.org/abs/2209.07089v1
- Date: Thu, 15 Sep 2022 07:01:42 GMT
- Title: Constrained Update Projection Approach to Safe Policy Optimization
- Authors: Long Yang, Jiaming Ji, Juntao Dai, Linrui Zhang, Binbin Zhou, Pengfei
Li, Yaodong Yang, Gang Pan
- Abstract summary: We propose CUP, a novel policy optimization method based on Constrained Update Projection framework.
CUP unifies performance bounds, providing a better understanding and interpretability for some existing algorithms.
Experiments show the effectiveness of CUP both in terms of reward and safety satisfaction.
- Score: 13.679149984354403
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Safe reinforcement learning (RL) studies problems where an intelligent agent
has to not only maximize reward but also avoid exploring unsafe areas. In this
study, we propose CUP, a novel policy optimization method based on Constrained
Update Projection framework that enjoys rigorous safety guarantee. Central to
our CUP development is the newly proposed surrogate functions along with the
performance bound. Compared to previous safe RL methods, CUP enjoys the
benefits of 1) CUP generalizes the surrogate functions to generalized advantage
estimator (GAE), leading to strong empirical performance. 2) CUP unifies
performance bounds, providing a better understanding and interpretability for
some existing algorithms; 3) CUP provides a non-convex implementation via only
first-order optimizers, which does not require any strong approximation on the
convexity of the objectives. To validate our CUP method, we compared CUP
against a comprehensive list of safe RL baselines on a wide range of tasks.
Experiments show the effectiveness of CUP both in terms of reward and safety
constraint satisfaction. We have opened the source code of CUP at
https://github.com/RL-boxes/Safe-RL/tree/ main/CUP.
Related papers
- Practical and Robust Safety Guarantees for Advanced Counterfactual Learning to Rank [64.44255178199846]
We generalize the existing safe CLTR approach to make it applicable to state-of-the-art doubly robust CLTR.
We also propose a novel approach, proximal ranking policy optimization (PRPO), that provides safety in deployment without assumptions about user behavior.
PRPO is the first method with unconditional safety in deployment that translates to robust safety for real-world applications.
arXiv Detail & Related papers (2024-07-29T12:23:59Z) - One-Shot Safety Alignment for Large Language Models via Optimal Dualization [64.52223677468861]
This paper presents a perspective of dualization that reduces constrained alignment to an equivalent unconstrained alignment problem.
We do so by pre-optimizing a smooth and convex dual function that has a closed form.
Our strategy leads to two practical algorithms in model-based and preference-based settings.
arXiv Detail & Related papers (2024-05-29T22:12:52Z) - Safety Optimized Reinforcement Learning via Multi-Objective Policy
Optimization [3.425378723819911]
Safe reinforcement learning (Safe RL) refers to a class of techniques that aim to prevent RL algorithms from violating constraints.
In this paper, a novel model-free Safe RL algorithm, formulated based on the multi-objective policy optimization framework is introduced.
arXiv Detail & Related papers (2024-02-23T08:58:38Z) - Adversarially Trained Weighted Actor-Critic for Safe Offline Reinforcement Learning [9.94248417157713]
We propose WSAC, a novel algorithm for Safe Offline Reinforcement Learning (RL) under functional approximation.
WSAC is designed as a two-player Stackelberg game to optimize a refined objective function.
arXiv Detail & Related papers (2024-01-01T01:44:58Z) - Approximate Model-Based Shielding for Safe Reinforcement Learning [83.55437924143615]
We propose a principled look-ahead shielding algorithm for verifying the performance of learned RL policies.
Our algorithm differs from other shielding approaches in that it does not require prior knowledge of the safety-relevant dynamics of the system.
We demonstrate superior performance to other safety-aware approaches on a set of Atari games with state-dependent safety-labels.
arXiv Detail & Related papers (2023-07-27T15:19:45Z) - A Multiplicative Value Function for Safe and Efficient Reinforcement
Learning [131.96501469927733]
We propose a safe model-free RL algorithm with a novel multiplicative value function consisting of a safety critic and a reward critic.
The safety critic predicts the probability of constraint violation and discounts the reward critic that only estimates constraint-free returns.
We evaluate our method in four safety-focused environments, including classical RL benchmarks augmented with safety constraints and robot navigation tasks with images and raw Lidar scans as observations.
arXiv Detail & Related papers (2023-03-07T18:29:15Z) - CUP: A Conservative Update Policy Algorithm for Safe Reinforcement
Learning [14.999515900425305]
We propose a Conservative Update Policy with a theoretical safety guarantee.
We provide rigorous theoretical analysis to extend the surrogate functions to generalized advantage (GAE)
Experiments show the effectiveness of the CUP to design safe constraints.
arXiv Detail & Related papers (2022-02-15T16:49:28Z) - Safe Policy Optimization with Local Generalized Linear Function
Approximations [17.84511819022308]
Existing safe exploration methods guaranteed safety under the assumption of regularity.
We propose a novel algorithm, SPO-LF, that optimize an agent's policy while learning the relation between a locally available feature obtained by sensors and environmental reward/safety.
We experimentally show that our algorithm is 1) more efficient in terms of sample complexity and computational cost and 2) more applicable to large-scale problems than previous safe RL methods with theoretical guarantees.
arXiv Detail & Related papers (2021-11-09T00:47:50Z) - Provably Efficient Algorithms for Multi-Objective Competitive RL [54.22598924633369]
We study multi-objective reinforcement learning (RL) where an agent's reward is represented as a vector.
In settings where an agent competes against opponents, its performance is measured by the distance of its average return vector to a target set.
We develop statistically and computationally efficient algorithms to approach the associated target set.
arXiv Detail & Related papers (2021-02-05T14:26:00Z) - Model-Based Actor-Critic with Chance Constraint for Stochastic System [6.600423613245076]
We propose a model-based chance constrained actor-critic (CCAC) algorithm which can efficiently learn a safe and non-conservative policy.
CCAC directly solves the original chance constrained problems, where the objective function and safe probability is simultaneously optimized with adaptive weights.
arXiv Detail & Related papers (2020-12-19T15:46:50Z) - CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.