MDPGT: Momentum-based Decentralized Policy Gradient Tracking
- URL: http://arxiv.org/abs/2112.02813v1
- Date: Mon, 6 Dec 2021 06:55:51 GMT
- Title: MDPGT: Momentum-based Decentralized Policy Gradient Tracking
- Authors: Zhanhong Jiang, Xian Yeow Lee, Sin Yong Tan, Kai Liang Tan, Aditya
Balu, Young M. Lee, Chinmay Hegde, Soumik Sarkar
- Abstract summary: We propose a momentum-based decentralized policy gradient tracking (MDPGT) for multi-agent reinforcement learning.
MDPGT achieves the best available sample complexity of $mathcalO(N-1epsilon-3)$ for converging to an $epsilon-stationary point of the global average of $N$ local performance functions.
This outperforms the state-of-the-art sample complexity in decentralized model-free reinforcement learning.
- Score: 29.22173174168708
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a novel policy gradient method for multi-agent reinforcement
learning, which leverages two different variance-reduction techniques and does
not require large batches over iterations. Specifically, we propose a
momentum-based decentralized policy gradient tracking (MDPGT) where a new
momentum-based variance reduction technique is used to approximate the local
policy gradient surrogate with importance sampling, and an intermediate
parameter is adopted to track two consecutive policy gradient surrogates.
Moreover, MDPGT provably achieves the best available sample complexity of
$\mathcal{O}(N^{-1}\epsilon^{-3})$ for converging to an $\epsilon$-stationary
point of the global average of $N$ local performance functions (possibly
nonconcave). This outperforms the state-of-the-art sample complexity in
decentralized model-free reinforcement learning, and when initialized with a
single trajectory, the sample complexity matches those obtained by the existing
decentralized policy gradient methods. We further validate the theoretical
claim for the Gaussian policy function. When the required error tolerance
$\epsilon$ is small enough, MDPGT leads to a linear speed up, which has been
previously established in decentralized stochastic optimization, but not for
reinforcement learning. Lastly, we provide empirical results on a multi-agent
reinforcement learning benchmark environment to support our theoretical
findings.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Finite-Time Convergence and Sample Complexity of Actor-Critic Multi-Objective Reinforcement Learning [20.491176017183044]
This paper tackles the multi-objective reinforcement learning (MORL) problem.
It introduces an innovative actor-critic algorithm named MOAC which finds a policy by iteratively making trade-offs among conflicting reward signals.
arXiv Detail & Related papers (2024-05-05T23:52:57Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Adaptive Policy Learning to Additional Tasks [3.43814540650436]
This paper develops a policy learning method for tuning a pre-trained policy to adapt to additional tasks without altering the original task.
A method named Adaptive Policy Gradient (APG) is proposed in this paper, which combines Bellman's principle of optimality with the policy gradient approach to improve the convergence rate.
arXiv Detail & Related papers (2023-05-24T14:31:11Z) - Policy Mirror Descent Inherently Explores Action Space [10.772560347950053]
We establish for the first time an $tildemathcalO (1/epsilon2)$ sample complexity for online policy gradient methods without any exploration strategies.
New policy gradient methods can prevent repeatedly committing to potentially high-risk actions when searching for optimal policies.
arXiv Detail & Related papers (2023-03-08T05:19:08Z) - Distributed Policy Gradient with Variance Reduction in Multi-Agent
Reinforcement Learning [7.4447396913959185]
This paper studies a distributed policy gradient in collaborative multi-agent reinforcement learning (MARL)
Agents over a communication network aim to find the optimal policy to maximize the average of all agents' local returns.
arXiv Detail & Related papers (2021-11-25T08:07:30Z) - Momentum Accelerates the Convergence of Stochastic AUPRC Maximization [80.8226518642952]
We study optimization of areas under precision-recall curves (AUPRC), which is widely used for imbalanced tasks.
We develop novel momentum methods with a better iteration of $O (1/epsilon4)$ for finding an $epsilon$stationary solution.
We also design a novel family of adaptive methods with the same complexity of $O (1/epsilon4)$, which enjoy faster convergence in practice.
arXiv Detail & Related papers (2021-07-02T16:21:52Z) - Bregman Gradient Policy Optimization [97.73041344738117]
We design a Bregman gradient policy optimization for reinforcement learning based on Bregman divergences and momentum techniques.
VR-BGPO reaches the best complexity $tilde(epsilon-3)$ for finding an $epsilon$stationary point only requiring one trajectory at each iteration.
arXiv Detail & Related papers (2021-06-23T01:08:54Z) - Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z) - Momentum-Based Policy Gradient Methods [133.53164856723782]
We propose a class of efficient momentum-based policy gradient methods for the model-free reinforcement learning.
In particular, we present a non-adaptive version of IS-MBPG method, which also reaches the best known sample complexity of $O(epsilon-3)$ without any large batches.
arXiv Detail & Related papers (2020-07-13T20:44:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.