Trust Region Policy Optimization with Optimal Transport Discrepancies:
Duality and Algorithm for Continuous Actions
- URL: http://arxiv.org/abs/2210.11137v1
- Date: Thu, 20 Oct 2022 10:04:35 GMT
- Title: Trust Region Policy Optimization with Optimal Transport Discrepancies:
Duality and Algorithm for Continuous Actions
- Authors: Antonio Terpin, Nicolas Lanzetti, Batuhan Yardim, Florian D\"orfler,
Giorgia Ramponi
- Abstract summary: Trust Region Policy Optimization is a popular approach to stabilize the policy updates.
We propose a novel algorithm - Optimal Transport Trust Region Policy Optimization (OT-TRPO) - for continuous state-action spaces.
Our results show that optimal transport discrepancies can offer an advantage over state-of-the-art approaches.
- Score: 5.820284464296154
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Policy Optimization (PO) algorithms have been proven particularly suited to
handle the high-dimensionality of real-world continuous control tasks. In this
context, Trust Region Policy Optimization methods represent a popular approach
to stabilize the policy updates. These usually rely on the Kullback-Leibler
(KL) divergence to limit the change in the policy. The Wasserstein distance
represents a natural alternative, in place of the KL divergence, to define
trust regions or to regularize the objective function. However,
state-of-the-art works either resort to its approximations or do not provide an
algorithm for continuous state-action spaces, reducing the applicability of the
method. In this paper, we explore optimal transport discrepancies (which
include the Wasserstein distance) to define trust regions, and we propose a
novel algorithm - Optimal Transport Trust Region Policy Optimization (OT-TRPO)
- for continuous state-action spaces. We circumvent the infinite-dimensional
optimization problem for PO by providing a one-dimensional dual reformulation
for which strong duality holds. We then analytically derive the optimal policy
update given the solution of the dual problem. This way, we bypass the
computation of optimal transport costs and of optimal transport maps, which we
implicitly characterize by solving the dual formulation. Finally, we provide an
experimental evaluation of our approach across various control tasks. Our
results show that optimal transport discrepancies can offer an advantage over
state-of-the-art approaches.
Related papers
- Deterministic Policy Gradient Primal-Dual Methods for Continuous-Space Constrained MDPs [82.34567890576423]
We develop a deterministic policy gradient primal-dual method to find an optimal deterministic policy with non-asymptotic convergence.
We prove that the primal-dual iterates of D-PGPD converge at a sub-linear rate to an optimal regularized primal-dual pair.
To the best of our knowledge, this appears to be the first work that proposes a deterministic policy search method for continuous-space constrained MDPs.
arXiv Detail & Related papers (2024-08-19T14:11:04Z) - Supported Trust Region Optimization for Offline Reinforcement Learning [59.43508325943592]
We propose Supported Trust Region optimization (STR) which performs trust region policy optimization with the policy constrained within the support of the behavior policy.
We show that, when assuming no approximation and sampling error, STR guarantees strict policy improvement until convergence to the optimal support-constrained policy in the dataset.
arXiv Detail & Related papers (2023-11-15T13:16:16Z) - Provably Convergent Policy Optimization via Metric-aware Trust Region
Methods [21.950484108431944]
Trust-region methods are pervasively used to stabilize policy optimization in reinforcement learning.
We exploit more flexible metrics and examine two natural extensions of policy optimization with Wasserstein and Sinkhorn trust regions.
We show that WPO guarantees a monotonic performance improvement, and SPO provably converges to WPO as the entropic regularizer diminishes.
arXiv Detail & Related papers (2023-06-25T05:41:38Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - Constrained Proximal Policy Optimization [36.20839673950677]
We propose a novel first-order feasible method named Constrained Proximal Policy Optimization (CPPO)
Our approach integrates the Expectation-Maximization framework to solve it through two steps: 1) calculating the optimal policy distribution within the feasible region (E-step), and 2) conducting a first-order update to adjust the current policy towards the optimal policy obtained in the E-step (M-step)
Empirical evaluations conducted in complex and uncertain environments validate the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-05-23T16:33:55Z) - Policy Gradient Algorithms Implicitly Optimize by Continuation [7.351769270728942]
We argue that exploration in policy-gradient algorithms consists in a continuation of the return of the policy at hand, and that policies should be history-dependent rather than to maximize the return.
arXiv Detail & Related papers (2023-05-11T14:50:20Z) - Local Optimization Achieves Global Optimality in Multi-Agent
Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.
We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z) - Trust-Region-Free Policy Optimization for Stochastic Policies [60.52463923712565]
We show that the trust region constraint over policies can be safely substituted by a trust-region-free constraint without compromising the underlying monotonic improvement guarantee.
We call the resulting algorithm Trust-REgion-Free Policy Optimization (TREFree) explicit as it is free of any trust region constraints.
arXiv Detail & Related papers (2023-02-15T23:10:06Z) - Distributionally-Constrained Policy Optimization via Unbalanced Optimal
Transport [15.294456568539148]
We formulate policy optimization as unbalanced optimal transport over the space of occupancy measures.
We propose a general purpose RL objective based on Bregman divergence and optimize it using Dykstra's algorithm.
arXiv Detail & Related papers (2021-02-15T23:04:37Z) - Optimistic Distributionally Robust Policy Optimization [2.345728642535161]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are prone to converge to a sub-optimal solution as they limit policy representation to a particular parametric distribution class.
We develop an innovative Optimistic Distributionally Robust Policy Optimization (ODRO) algorithm to solve the trust region constrained optimization problem without parameterizing the policies.
Our algorithm improves TRPO and PPO with a higher sample efficiency and a better performance of the final policy while attaining the learning stability.
arXiv Detail & Related papers (2020-06-14T06:36:18Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.