Penalized Proximal Policy Optimization for Safe Reinforcement Learning
- URL: http://arxiv.org/abs/2205.11814v1
- Date: Tue, 24 May 2022 06:15:51 GMT
- Title: Penalized Proximal Policy Optimization for Safe Reinforcement Learning
- Authors: Linrui zhang, Li Shen, Long Yang, Shixiang Chen, Bo Yuan, Xueqian
Wang, Dacheng Tao
- Abstract summary: We propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem.
P3O utilizes a simple-yet-effective penalty function to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective.
We show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.
- Score: 68.86485583981866
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Safe reinforcement learning aims to learn the optimal policy while satisfying
safety constraints, which is essential in real-world applications. However,
current algorithms still struggle for efficient policy updates with hard
constraint satisfaction. In this paper, we propose Penalized Proximal Policy
Optimization (P3O), which solves the cumbersome constrained policy iteration
via a single minimization of an equivalent unconstrained problem. Specifically,
P3O utilizes a simple-yet-effective penalty function to eliminate cost
constraints and removes the trust-region constraint by the clipped surrogate
objective. We theoretically prove the exactness of the proposed method with a
finite penalty factor and provide a worst-case analysis for approximate error
when evaluated on sample trajectories. Moreover, we extend P3O to more
challenging multi-constraint and multi-agent scenarios which are less studied
in previous work. Extensive experiments show that P3O outperforms
state-of-the-art algorithms with respect to both reward improvement and
constraint satisfaction on a set of constrained locomotive tasks.
Related papers
- Adversarial Constrained Policy Optimization: Improving Constrained Reinforcement Learning by Adapting Budgets [6.5472155063246085]
Constrained reinforcement learning has achieved promising progress in safety-critical fields where both rewards and constraints are considered.
We propose Adrial Constrained Policy Optimization (ACPO), which enables simultaneous optimization of reward and the adaptation of cost budgets during training.
arXiv Detail & Related papers (2024-10-28T07:04:32Z) - Exterior Penalty Policy Optimization with Penalty Metric Network under Constraints [52.37099916582462]
In Constrained Reinforcement Learning (CRL), agents explore the environment to learn the optimal policy while satisfying constraints.
We propose a theoretically guaranteed penalty function method, Exterior Penalty Policy Optimization (EPO), with adaptive penalties generated by a Penalty Metric Network (PMN)
PMN responds appropriately to varying degrees of constraint violations, enabling efficient constraint satisfaction and safe exploration.
arXiv Detail & Related papers (2024-07-22T10:57:32Z) - Resilient Constrained Reinforcement Learning [87.4374430686956]
We study a class of constrained reinforcement learning (RL) problems in which multiple constraint specifications are not identified before study.
It is challenging to identify appropriate constraint specifications due to the undefined trade-off between the reward training objective and the constraint satisfaction.
We propose a new constrained RL approach that searches for policy and constraint specifications together.
arXiv Detail & Related papers (2023-12-28T18:28:23Z) - COptiDICE: Offline Constrained Reinforcement Learning via Stationary
Distribution Correction Estimation [73.17078343706909]
offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected return while satisfying given cost constraints, learning only from a pre-collected dataset.
We present an offline constrained RL algorithm that optimize the policy in the space of the stationary distribution.
Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction.
arXiv Detail & Related papers (2022-04-19T15:55:47Z) - Shortest-Path Constrained Reinforcement Learning for Sparse Reward Tasks [59.419152768018506]
We show that any optimal policy necessarily satisfies the k-SP constraint.
We propose a novel cost function that penalizes the policy violating SP constraint, instead of completely excluding it.
Our experiments on MiniGrid, DeepMind Lab, Atari, and Fetch show that the proposed method significantly improves proximal policy optimization (PPO)
arXiv Detail & Related papers (2021-07-13T21:39:21Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Projection-Based Constrained Policy Optimization [34.555500347840805]
We propose a new algorithm, Projection-Based Constrained Policy Optimization (PCPO)
PCPO achieves more than 3.5 times less constraint violation and around 15% higher reward compared to state-of-the-art methods.
arXiv Detail & Related papers (2020-10-07T04:22:45Z) - FISAR: Forward Invariant Safe Reinforcement Learning with a Deep Neural
Network-Based Optimize [44.65622657676026]
We take constraints as Lyapunov functions and impose new linear constraints on the policy parameters' updating dynamics.
Because the new guaranteed-feasible constraints are imposed on the updating dynamics instead of the original policy parameters, classic optimization algorithms are no longer applicable.
arXiv Detail & Related papers (2020-06-19T21:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.