CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee
- URL: http://arxiv.org/abs/2011.05869v3
- Date: Mon, 31 May 2021 04:41:09 GMT
- Title: CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee
- Authors: Tengyu Xu, Yingbin Liang, Guanghui Lan
- Abstract summary: In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
- Score: 61.176159046544946
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In safe reinforcement learning (SRL) problems, an agent explores the
environment to maximize an expected total reward and meanwhile avoids violation
of certain constraints on a number of expected total costs. In general, such
SRL problems have nonconvex objective functions subject to multiple nonconvex
constraints, and hence are very challenging to solve, particularly to provide a
globally optimal policy. Many popular SRL algorithms adopt a primal-dual
structure which utilizes the updating of dual variables for satisfying the
constraints. In contrast, we propose a primal approach, called
constraint-rectified policy optimization (CRPO), which updates the policy
alternatingly between objective improvement and constraint satisfaction. CRPO
provides a primal-type algorithmic framework to solve SRL problems, where each
policy update can take any variant of policy optimization step. To demonstrate
the theoretical performance of CRPO, we adopt natural policy gradient (NPG) for
each policy update step and show that CRPO achieves an
$\mathcal{O}(1/\sqrt{T})$ convergence rate to the global optimal policy in the
constrained policy set and an $\mathcal{O}(1/\sqrt{T})$ error bound on
constraint satisfaction. This is the first finite-time analysis of primal SRL
algorithms with global optimality guarantee. Our empirical results demonstrate
that CRPO can outperform the existing primal-dual baseline algorithms
significantly.
Related papers
- Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Fast Policy Learning for Linear Quadratic Control with Entropy
Regularization [10.771650397337366]
This paper proposes and analyzes two new policy learning methods: regularized policy gradient (RPG) and iterative policy optimization (IPO), for a class of discounted linear-quadratic control (LQC) problems.
Assuming access to the exact policy evaluation, both proposed approaches are proven to converge linearly in finding optimal policies of the regularized LQC.
arXiv Detail & Related papers (2023-11-23T19:08:39Z) - Constrained Proximal Policy Optimization [36.20839673950677]
We propose a novel first-order feasible method named Constrained Proximal Policy Optimization (CPPO)
Our approach integrates the Expectation-Maximization framework to solve it through two steps: 1) calculating the optimal policy distribution within the feasible region (E-step), and 2) conducting a first-order update to adjust the current policy towards the optimal policy obtained in the E-step (M-step)
Empirical evaluations conducted in complex and uncertain environments validate the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-05-23T16:33:55Z) - Faster Last-iterate Convergence of Policy Optimization in Zero-Sum
Markov Games [63.60117916422867]
This paper focuses on the most basic setting of competitive multi-agent RL, namely two-player zero-sum Markov games.
We propose a single-loop policy optimization method with symmetric updates from both agents, where the policy is updated via the entropy-regularized optimistic multiplicative weights update (OMWU) method.
Our convergence results improve upon the best known complexities, and lead to a better understanding of policy optimization in competitive Markov games.
arXiv Detail & Related papers (2022-10-03T16:05:43Z) - COptiDICE: Offline Constrained Reinforcement Learning via Stationary
Distribution Correction Estimation [73.17078343706909]
offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected return while satisfying given cost constraints, learning only from a pre-collected dataset.
We present an offline constrained RL algorithm that optimize the policy in the space of the stationary distribution.
Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction.
arXiv Detail & Related papers (2022-04-19T15:55:47Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Projection-Based Constrained Policy Optimization [34.555500347840805]
We propose a new algorithm, Projection-Based Constrained Policy Optimization (PCPO)
PCPO achieves more than 3.5 times less constraint violation and around 15% higher reward compared to state-of-the-art methods.
arXiv Detail & Related papers (2020-10-07T04:22:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.