Constrained Variational Policy Optimization for Safe Reinforcement
Learning
- URL: http://arxiv.org/abs/2201.11927v1
- Date: Fri, 28 Jan 2022 04:24:09 GMT
- Title: Constrained Variational Policy Optimization for Safe Reinforcement
Learning
- Authors: Zuxin Liu, Zhepeng Cen, Vladislav Isenbaev, Wei Liu, Zhiwei Steven Wu,
Bo Li, Ding Zhao
- Abstract summary: Safe reinforcement learning aims to learn policies that satisfy certain constraints before deploying to safety-critical applications.
primal-dual as a prevalent constrained optimization framework suffers from instability issues and lacks optimality guarantees.
This paper overcomes the issues from a novel probabilistic inference perspective and proposes an Expectation-Maximization style approach to learn safe policy.
- Score: 40.38842532850959
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Safe reinforcement learning (RL) aims to learn policies that satisfy certain
constraints before deploying to safety-critical applications. Primal-dual as a
prevalent constrained optimization framework suffers from instability issues
and lacks optimality guarantees. This paper overcomes the issues from a novel
probabilistic inference perspective and proposes an Expectation-Maximization
style approach to learn safe policy. We show that the safe RL problem can be
decomposed to 1) a convex optimization phase with a non-parametric variational
distribution and 2) a supervised learning phase. We show the unique advantages
of constrained variational policy optimization by proving its optimality and
policy improvement stability. A wide range of experiments on continuous robotic
tasks show that the proposed method achieves significantly better performance
in terms of constraint satisfaction and sample efficiency than primal-dual
baselines.
Related papers
- Optimization Solution Functions as Deterministic Policies for Offline Reinforcement Learning [7.07623669995408]
We propose an implicit actor-critic (iAC) framework that employs optimization solution functions as a deterministic policy (actor) and a monotone function over the optimal value of optimization as a critic.
We show that the learned policies are robust to the suboptimality of the learned actor parameters via the exponentially decaying sensitivity (EDS) property.
We validate the proposed framework on two real-world applications and show a significant improvement over state-of-the-art (SOTA) offline RL methods.
arXiv Detail & Related papers (2024-08-27T19:04:32Z) - One-Shot Safety Alignment for Large Language Models via Optimal Dualization [64.52223677468861]
This paper presents a perspective of dualization that reduces constrained alignment to an equivalent unconstrained alignment problem.
We do so by pre-optimizing a smooth and convex dual function that has a closed form.
Our strategy leads to two practical algorithms in model-based and preference-based settings.
arXiv Detail & Related papers (2024-05-29T22:12:52Z) - SCPO: Safe Reinforcement Learning with Safety Critic Policy Optimization [1.3597551064547502]
This study introduces a novel safe reinforcement learning algorithm, Safety Critic Policy Optimization.
In this study, we define the safety critic, a mechanism that nullifies rewards obtained through violating safety constraints.
Our theoretical analysis indicates that the proposed algorithm can automatically balance the trade-off between adhering to safety constraints and maximizing rewards.
arXiv Detail & Related papers (2023-11-01T22:12:50Z) - Constraint-Conditioned Policy Optimization for Versatile Safe Reinforcement Learning [33.988698754176646]
We introduce the Conditioned Constrained Policy Optimization (CCPO) framework, consisting of two key modules.
Our experiments demonstrate that CCPO outperforms the baselines in terms of safety and task performance.
This makes our approach suitable for real-world dynamic applications.
arXiv Detail & Related papers (2023-10-05T17:39:02Z) - Iterative Reachability Estimation for Safe Reinforcement Learning [23.942701020636882]
We propose a new framework, Reachability Estimation for Safe Policy Optimization (RESPO), for safety-constrained reinforcement learning (RL) environments.
In the feasible set where there exist violation-free policies, we optimize for rewards while maintaining persistent safety.
We evaluate the proposed methods on a diverse suite of safe RL environments from Safety Gym, PyBullet, and MuJoCo.
arXiv Detail & Related papers (2023-09-24T02:36:42Z) - Trust-Region-Free Policy Optimization for Stochastic Policies [60.52463923712565]
We show that the trust region constraint over policies can be safely substituted by a trust-region-free constraint without compromising the underlying monotonic improvement guarantee.
We call the resulting algorithm Trust-REgion-Free Policy Optimization (TREFree) explicit as it is free of any trust region constraints.
arXiv Detail & Related papers (2023-02-15T23:10:06Z) - Penalized Proximal Policy Optimization for Safe Reinforcement Learning [68.86485583981866]
We propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem.
P3O utilizes a simple-yet-effective penalty function to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective.
We show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.
arXiv Detail & Related papers (2022-05-24T06:15:51Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z) - Chance Constrained Policy Optimization for Process Control and
Optimization [1.4908563154226955]
Chemical process optimization and control are affected by 1) plant-model mismatch, 2) process disturbances, and 3) constraints for safe operation.
We propose a chance constrained policy optimization algorithm which guarantees the satisfaction of joint chance constraints with a high probability.
arXiv Detail & Related papers (2020-07-30T14:20:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.