Provably Convergent Policy Optimization via Metric-aware Trust Region
Methods
- URL: http://arxiv.org/abs/2306.14133v1
- Date: Sun, 25 Jun 2023 05:41:38 GMT
- Title: Provably Convergent Policy Optimization via Metric-aware Trust Region
Methods
- Authors: Jun Song, Niao He, Lijun Ding and Chaoyue Zhao
- Abstract summary: Trust-region methods are pervasively used to stabilize policy optimization in reinforcement learning.
We exploit more flexible metrics and examine two natural extensions of policy optimization with Wasserstein and Sinkhorn trust regions.
We show that WPO guarantees a monotonic performance improvement, and SPO provably converges to WPO as the entropic regularizer diminishes.
- Score: 21.950484108431944
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Trust-region methods based on Kullback-Leibler divergence are pervasively
used to stabilize policy optimization in reinforcement learning. In this paper,
we exploit more flexible metrics and examine two natural extensions of policy
optimization with Wasserstein and Sinkhorn trust regions, namely Wasserstein
policy optimization (WPO) and Sinkhorn policy optimization (SPO). Instead of
restricting the policy to a parametric distribution class, we directly optimize
the policy distribution and derive their closed-form policy updates based on
the Lagrangian duality. Theoretically, we show that WPO guarantees a monotonic
performance improvement, and SPO provably converges to WPO as the entropic
regularizer diminishes. Moreover, we prove that with a decaying Lagrangian
multiplier to the trust region constraint, both methods converge to global
optimality. Experiments across tabular domains, robotic locomotion, and
continuous control tasks further demonstrate the performance improvement of
both approaches, more robustness of WPO to sample insufficiency, and faster
convergence of SPO, over state-of-art policy gradient methods.
Related papers
- Supported Trust Region Optimization for Offline Reinforcement Learning [59.43508325943592]
We propose Supported Trust Region optimization (STR) which performs trust region policy optimization with the policy constrained within the support of the behavior policy.
We show that, when assuming no approximation and sampling error, STR guarantees strict policy improvement until convergence to the optimal support-constrained policy in the dataset.
arXiv Detail & Related papers (2023-11-15T13:16:16Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - Local Optimization Achieves Global Optimality in Multi-Agent
Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.
We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z) - Trust-Region-Free Policy Optimization for Stochastic Policies [60.52463923712565]
We show that the trust region constraint over policies can be safely substituted by a trust-region-free constraint without compromising the underlying monotonic improvement guarantee.
We call the resulting algorithm Trust-REgion-Free Policy Optimization (TREFree) explicit as it is free of any trust region constraints.
arXiv Detail & Related papers (2023-02-15T23:10:06Z) - Trust Region Policy Optimization with Optimal Transport Discrepancies:
Duality and Algorithm for Continuous Actions [5.820284464296154]
Trust Region Policy Optimization is a popular approach to stabilize the policy updates.
We propose a novel algorithm - Optimal Transport Trust Region Policy Optimization (OT-TRPO) - for continuous state-action spaces.
Our results show that optimal transport discrepancies can offer an advantage over state-of-the-art approaches.
arXiv Detail & Related papers (2022-10-20T10:04:35Z) - Memory-Constrained Policy Optimization [59.63021433336966]
We introduce a new constrained optimization method for policy gradient reinforcement learning.
We form a second trust region through the construction of another virtual policy that represents a wide range of past policies.
We then enforce the new policy to stay closer to the virtual policy, which is beneficial in case the old policy performs badly.
arXiv Detail & Related papers (2022-04-20T08:50:23Z) - Optimistic Distributionally Robust Policy Optimization [2.345728642535161]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are prone to converge to a sub-optimal solution as they limit policy representation to a particular parametric distribution class.
We develop an innovative Optimistic Distributionally Robust Policy Optimization (ODRO) algorithm to solve the trust region constrained optimization problem without parameterizing the policies.
Our algorithm improves TRPO and PPO with a higher sample efficiency and a better performance of the final policy while attaining the learning stability.
arXiv Detail & Related papers (2020-06-14T06:36:18Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.