Mirror Descent Policy Optimization
- URL: http://arxiv.org/abs/2005.09814v5
- Date: Mon, 7 Jun 2021 13:44:15 GMT
- Title: Mirror Descent Policy Optimization
- Authors: Manan Tomar, Lior Shani, Yonathan Efroni, Mohammad Ghavamzadeh
- Abstract summary: We propose an efficient RL algorithm, called em mirror descent policy optimization (MDPO)
MDPO iteratively updates the policy by em approximately solving a trust-region problem.
We highlight the connections between on-policy MDPO and two popular trust-region RL algorithms: TRPO and PPO, and show that explicitly enforcing the trust-region constraint is in fact em not a necessity for high performance gains in TRPO.
- Score: 41.46894905097985
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mirror descent (MD), a well-known first-order method in constrained convex
optimization, has recently been shown as an important tool to analyze
trust-region algorithms in reinforcement learning (RL). However, there remains
a considerable gap between such theoretically analyzed algorithms and the ones
used in practice. Inspired by this, we propose an efficient RL algorithm,
called {\em mirror descent policy optimization} (MDPO). MDPO iteratively
updates the policy by {\em approximately} solving a trust-region problem, whose
objective function consists of two terms: a linearization of the standard RL
objective and a proximity term that restricts two consecutive policies to be
close to each other. Each update performs this approximation by taking multiple
gradient steps on this objective function. We derive {\em on-policy} and {\em
off-policy} variants of MDPO, while emphasizing important design choices
motivated by the existing theory of MD in RL. We highlight the connections
between on-policy MDPO and two popular trust-region RL algorithms: TRPO and
PPO, and show that explicitly enforcing the trust-region constraint is in fact
{\em not} a necessity for high performance gains in TRPO. We then show how the
popular soft actor-critic (SAC) algorithm can be derived by slight
modifications of off-policy MDPO. Overall, MDPO is derived from the MD
principles, offers a unified approach to viewing a number of popular RL
algorithms, and performs better than or on-par with TRPO, PPO, and SAC in a
number of continuous control tasks. Code is available at
\url{https://github.com/manantomar/Mirror-Descent-Policy-Optimization}.
Related papers
- A Theoretical Analysis of Optimistic Proximal Policy Optimization in
Linear Markov Decision Processes [13.466249082564213]
We propose an optimistic variant of PPO for episodic adversarial linear MDPs with full-information feedback.
Compared with existing policy-based algorithms, we achieve the state-of-the-art regret bound in both linear MDPs and adversarial linear MDPs with full information.
arXiv Detail & Related papers (2023-05-15T17:55:24Z) - Local Optimization Achieves Global Optimality in Multi-Agent
Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.
We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Towards Global Optimality in Cooperative MARL with the Transformation
And Distillation Framework [26.612749327414335]
Decentralized execution is one core demand in cooperative multi-agent reinforcement learning (MARL)
In this paper, we theoretically analyze two common classes of algorithms with decentralized policies -- multi-agent policy gradient methods and value-decomposition methods.
We show that TAD-PPO can theoretically perform optimal policy learning in the finite multi-agent MDPs and shows significant outperformance on a large set of cooperative multi-agent tasks.
arXiv Detail & Related papers (2022-07-12T06:59:13Z) - Robustness and risk management via distributional dynamic programming [13.173307471333619]
We introduce a new class of distributional operators, together with a practical DP algorithm for policy evaluation.
Our approach reformulates through an augmented state space where each state is split into a worst-case substate and a best-case substate.
We derive distributional operators and DP algorithms solving a new control task.
arXiv Detail & Related papers (2021-12-28T12:12:57Z) - Offline RL Without Off-Policy Evaluation [49.11859771578969]
We show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well.
This one-step algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark.
arXiv Detail & Related papers (2021-06-16T16:04:26Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z) - Queueing Network Controls via Deep Reinforcement Learning [0.0]
We develop a Proximal policy optimization algorithm for queueing networks.
The algorithm consistently generates control policies that outperform state-of-arts in literature.
A key to the successes of our PPO algorithm is the use of three variance reduction techniques in estimating the relative value function.
arXiv Detail & Related papers (2020-07-31T01:02:57Z) - Implementation Matters in Deep Policy Gradients: A Case Study on PPO and
TRPO [90.90009491366273]
We study the roots of algorithmic progress in deep policy gradient algorithms through a case study on two popular algorithms.
Specifically, we investigate the consequences of "code-level optimizations:"
Our results show that they (a) are responsible for most of PPO's gain in cumulative reward over TRPO, and (b) fundamentally change how RL methods function.
arXiv Detail & Related papers (2020-05-25T16:24:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.