Distributed Policy Gradient with Variance Reduction in Multi-Agent
Reinforcement Learning
- URL: http://arxiv.org/abs/2111.12961v1
- Date: Thu, 25 Nov 2021 08:07:30 GMT
- Title: Distributed Policy Gradient with Variance Reduction in Multi-Agent
Reinforcement Learning
- Authors: Xiaoxiao Zhao, Jinlong Lei, Li Li
- Abstract summary: This paper studies a distributed policy gradient in collaborative multi-agent reinforcement learning (MARL)
Agents over a communication network aim to find the optimal policy to maximize the average of all agents' local returns.
- Score: 7.4447396913959185
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper studies a distributed policy gradient in collaborative multi-agent
reinforcement learning (MARL), where agents over a communication network aim to
find the optimal policy to maximize the average of all agents' local returns.
Due to the non-concave performance function of policy gradient, the existing
distributed stochastic optimization methods for convex problems cannot be
directly used for policy gradient in MARL. This paper proposes a distributed
policy gradient with variance reduction and gradient tracking to address the
high variances of policy gradient, and utilizes importance weight to solve the
non-stationary problem in the sampling process. We then provide an upper bound
on the mean-squared stationary gap, which depends on the number of iterations,
the mini-batch size, the epoch size, the problem parameters, and the network
topology. We further establish the sample and communication complexity to
obtain an $\epsilon$-approximate stationary point. Numerical experiments on the
control problem in MARL are performed to validate the effectiveness of the
proposed algorithm.
Related papers
- Landscape of Policy Optimization for Finite Horizon MDPs with General State and Action [10.219627570276689]
We develop a framework for a class of Markov Decision Processes with general state and spaces.
We show that gradient methods converge to the globally optimal policy with a nonasymptomatic condition.
Our result establishes first complexity for multi-period inventory systems.
arXiv Detail & Related papers (2024-09-25T17:56:02Z) - Score-Aware Policy-Gradient Methods and Performance Guarantees using Local Lyapunov Conditions: Applications to Product-Form Stochastic Networks and Queueing Systems [1.747623282473278]
We introduce a policygradient method for model reinforcement learning (RL) that exploits a type of stationary distributions commonly obtained from decision processes (MDPs) in networks.
Specifically, when the stationary distribution of the MDP is parametrized by policy parameters, we can improve existing policy methods for average-reward estimation.
arXiv Detail & Related papers (2023-12-05T14:44:58Z) - PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback [106.63518036538163]
We present a novel unified bilevel optimization-based framework, textsfPARL, formulated to address the recently highlighted critical issue of policy alignment in reinforcement learning.
Our framework addressed these concerns by explicitly parameterizing the distribution of the upper alignment objective (reward design) by the lower optimal variable.
Our empirical results substantiate that the proposed textsfPARL can address the alignment concerns in RL by showing significant improvements.
arXiv Detail & Related papers (2023-08-03T18:03:44Z) - Monte Carlo Policy Gradient Method for Binary Optimization [3.742634130733923]
We develop a novel probabilistic model to sample the binary solution according to a parameterized policy distribution.
For coherent exploration in discrete spaces, parallel Markov Chain Monte Carlo (MCMC) methods are employed.
Convergence to stationary points in expectation of the policy gradient method is established.
arXiv Detail & Related papers (2023-07-03T07:01:42Z) - High-probability sample complexities for policy evaluation with linear function approximation [88.87036653258977]
We investigate the sample complexities required to guarantee a predefined estimation error of the best linear coefficients for two widely-used policy evaluation algorithms.
We establish the first sample complexity bound with high-probability convergence guarantee that attains the optimal dependence on the tolerance level.
arXiv Detail & Related papers (2023-05-30T12:58:39Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm.
We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z) - Stochastic first-order methods for average-reward Markov decision processes [10.023632561462712]
We study average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy optimization and policy evaluation.
By combining the policy evaluation and policy optimization parts, we establish sample complexity results for solving AMDPs under both generative and Markovian noise models.
arXiv Detail & Related papers (2022-05-11T23:02:46Z) - MDPGT: Momentum-based Decentralized Policy Gradient Tracking [29.22173174168708]
We propose a momentum-based decentralized policy gradient tracking (MDPGT) for multi-agent reinforcement learning.
MDPGT achieves the best available sample complexity of $mathcalO(N-1epsilon-3)$ for converging to an $epsilon-stationary point of the global average of $N$ local performance functions.
This outperforms the state-of-the-art sample complexity in decentralized model-free reinforcement learning.
arXiv Detail & Related papers (2021-12-06T06:55:51Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z) - Implicit Distributional Reinforcement Learning [61.166030238490634]
implicit distributional actor-critic (IDAC) built on two deep generator networks (DGNs)
Semi-implicit actor (SIA) powered by a flexible policy distribution.
We observe IDAC outperforms state-of-the-art algorithms on representative OpenAI Gym environments.
arXiv Detail & Related papers (2020-07-13T02:52:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.