Constrained Proximal Policy Optimization
- URL: http://arxiv.org/abs/2305.14216v1
- Date: Tue, 23 May 2023 16:33:55 GMT
- Title: Constrained Proximal Policy Optimization
- Authors: Chengbin Xuan, Feng Zhang, Faliang Yin, Hak-Keung Lam
- Abstract summary: We propose a novel first-order feasible method named Constrained Proximal Policy Optimization (CPPO)
Our approach integrates the Expectation-Maximization framework to solve it through two steps: 1) calculating the optimal policy distribution within the feasible region (E-step), and 2) conducting a first-order update to adjust the current policy towards the optimal policy obtained in the E-step (M-step)
Empirical evaluations conducted in complex and uncertain environments validate the effectiveness of our proposed method.
- Score: 36.20839673950677
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The problem of constrained reinforcement learning (CRL) holds significant
importance as it provides a framework for addressing critical safety
satisfaction concerns in the field of reinforcement learning (RL). However,
with the introduction of constraint satisfaction, the current CRL methods
necessitate the utilization of second-order optimization or primal-dual
frameworks with additional Lagrangian multipliers, resulting in increased
complexity and inefficiency during implementation. To address these issues, we
propose a novel first-order feasible method named Constrained Proximal Policy
Optimization (CPPO). By treating the CRL problem as a probabilistic inference
problem, our approach integrates the Expectation-Maximization framework to
solve it through two steps: 1) calculating the optimal policy distribution
within the feasible region (E-step), and 2) conducting a first-order update to
adjust the current policy towards the optimal policy obtained in the E-step
(M-step). We establish the relationship between the probability ratios and KL
divergence to convert the E-step into a convex optimization problem.
Furthermore, we develop an iterative heuristic algorithm from a geometric
perspective to solve this problem. Additionally, we introduce a conservative
update mechanism to overcome the constraint violation issue that occurs in the
existing feasible region method. Empirical evaluations conducted in complex and
uncertain environments validate the effectiveness of our proposed method, as it
performs at least as well as other baselines.
Related papers
- Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Double Duality: Variational Primal-Dual Policy Optimization for
Constrained Reinforcement Learning [132.7040981721302]
We study the Constrained Convex Decision Process (MDP), where the goal is to minimize a convex functional of the visitation measure.
Design algorithms for a constrained convex MDP faces several challenges, including handling the large state space.
arXiv Detail & Related papers (2024-02-16T16:35:18Z) - Fast Policy Learning for Linear Quadratic Control with Entropy
Regularization [10.771650397337366]
This paper proposes and analyzes two new policy learning methods: regularized policy gradient (RPG) and iterative policy optimization (IPO), for a class of discounted linear-quadratic control (LQC) problems.
Assuming access to the exact policy evaluation, both proposed approaches are proven to converge linearly in finding optimal policies of the regularized LQC.
arXiv Detail & Related papers (2023-11-23T19:08:39Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - COptiDICE: Offline Constrained Reinforcement Learning via Stationary
Distribution Correction Estimation [73.17078343706909]
offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected return while satisfying given cost constraints, learning only from a pre-collected dataset.
We present an offline constrained RL algorithm that optimize the policy in the space of the stationary distribution.
Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction.
arXiv Detail & Related papers (2022-04-19T15:55:47Z) - Faster Algorithm and Sharper Analysis for Constrained Markov Decision
Process [56.55075925645864]
The problem of constrained decision process (CMDP) is investigated, where an agent aims to maximize the expected accumulated discounted reward subject to multiple constraints.
A new utilities-dual convex approach is proposed with novel integration of three ingredients: regularized policy, dual regularizer, and Nesterov's gradient descent dual.
This is the first demonstration that nonconcave CMDP problems can attain the lower bound of $mathcal O (1/epsilon)$ for all complexity optimization subject to convex constraints.
arXiv Detail & Related papers (2021-10-20T02:57:21Z) - Successive Convex Approximation Based Off-Policy Optimization for
Constrained Reinforcement Learning [12.523496806744946]
We propose a convex approximation based off-policy optimization (SCAOPO) algorithm to solve the general constrained reinforcement learning problem.
In spite of the time-varying state distribution and the bias incurred by the off-policy learning, the SCAOPO with a feasible initial point can still provably converge to a Karush-Kuhn-Tucker point.
arXiv Detail & Related papers (2021-05-26T13:52:39Z) - CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z) - Chance Constrained Policy Optimization for Process Control and
Optimization [1.4908563154226955]
Chemical process optimization and control are affected by 1) plant-model mismatch, 2) process disturbances, and 3) constraints for safe operation.
We propose a chance constrained policy optimization algorithm which guarantees the satisfaction of joint chance constraints with a high probability.
arXiv Detail & Related papers (2020-07-30T14:20:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.