Related papers: Distributed Policy Gradient with Variance Reduction in Multi-Agent Reinforcement Learning

Distributed Policy Gradient with Variance Reduction in Multi-Agent Reinforcement Learning

URL: http://arxiv.org/abs/2111.12961v1
Date: Thu, 25 Nov 2021 08:07:30 GMT
Title: Distributed Policy Gradient with Variance Reduction in Multi-Agent Reinforcement Learning
Authors: Xiaoxiao Zhao, Jinlong Lei, Li Li
Abstract summary: This paper studies a distributed policy gradient in collaborative multi-agent reinforcement learning (MARL) Agents over a communication network aim to find the optimal policy to maximize the average of all agents' local returns.
Score: 7.4447396913959185
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper studies a distributed policy gradient in collaborative multi-agent reinforcement learning (MARL), where agents over a communication network aim to find the optimal policy to maximize the average of all agents' local returns. Due to the non-concave performance function of policy gradient, the existing distributed stochastic optimization methods for convex problems cannot be directly used for policy gradient in MARL. This paper proposes a distributed policy gradient with variance reduction and gradient tracking to address the high variances of policy gradient, and utilizes importance weight to solve the non-stationary problem in the sampling process. We then provide an upper bound on the mean-squared stationary gap, which depends on the number of iterations, the mini-batch size, the epoch size, the problem parameters, and the network topology. We further establish the sample and communication complexity to obtain an $\epsilon$-approximate stationary point. Numerical experiments on the control problem in MARL are performed to validate the effectiveness of the proposed algorithm.

Related papers

Centralized Adaptive Sampling for Reliable Co-Training of Independent Multi-Agent Policies [3.5253513747455303]
Independent on-policy policy gradient algorithms are widely used for multi-agent reinforcement learning (MARL) in cooperative and no-conflict games.<n>They are known to converge suboptimally when each agent's policy gradient points toward a suboptimal equilibrium.<n>We introduce an adaptive action sampling approach to reduce joint sampling error.
arXiv Detail & Related papers (2025-08-01T20:07:25Z)
On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning [50.856589224454055]
Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs)<n>We propose regularized policy gradient (RPG), a framework for deriving and analyzing KL-regularized policy gradient methods in the online reinforcement learning setting.<n>RPG shows improved or competitive results in terms of training stability and performance compared to strong baselines such as GRPO, REINFORCE++, and DAPO.
arXiv Detail & Related papers (2025-05-23T06:01:21Z)
Decentralized Nonconvex Composite Federated Learning with Gradient Tracking and Momentum [78.27945336558987]
Decentralized server (DFL) eliminates reliance on client-client architecture. Non-smooth regularization is often incorporated into machine learning tasks. We propose a novel novel DNCFL algorithm to solve these problems.
arXiv Detail & Related papers (2025-04-17T08:32:25Z)
Landscape of Policy Optimization for Finite Horizon MDPs with General State and Action [10.219627570276689]
We develop a framework for a class of Markov Decision Processes with general state and spaces. We show that gradient methods converge to the globally optimal policy with a nonasymptomatic condition. Our result establishes first complexity for multi-period inventory systems.
arXiv Detail & Related papers (2024-09-25T17:56:02Z)
Score-Aware Policy-Gradient Methods and Performance Guarantees using Local Lyapunov Conditions: Applications to Product-Form Stochastic Networks and Queueing Systems [1.747623282473278]
We introduce a policygradient method for model reinforcement learning (RL) that exploits a type of stationary distributions commonly obtained from decision processes (MDPs) in networks. Specifically, when the stationary distribution of the MDP is parametrized by policy parameters, we can improve existing policy methods for average-reward estimation.
arXiv Detail & Related papers (2023-12-05T14:44:58Z)
PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback [106.63518036538163]
We present a novel unified bilevel optimization-based framework, textsfPARL, formulated to address the recently highlighted critical issue of policy alignment in reinforcement learning. Our framework addressed these concerns by explicitly parameterizing the distribution of the upper alignment objective (reward design) by the lower optimal variable. Our empirical results substantiate that the proposed textsfPARL can address the alignment concerns in RL by showing significant improvements.
arXiv Detail & Related papers (2023-08-03T18:03:44Z)
Monte Carlo Policy Gradient Method for Binary Optimization [3.742634130733923]
We develop a novel probabilistic model to sample the binary solution according to a parameterized policy distribution. For coherent exploration in discrete spaces, parallel Markov Chain Monte Carlo (MCMC) methods are employed. Convergence to stationary points in expectation of the policy gradient method is established.
arXiv Detail & Related papers (2023-07-03T07:01:42Z)
High-probability sample complexities for policy evaluation with linear function approximation [88.87036653258977]
We investigate the sample complexities required to guarantee a predefined estimation error of the best linear coefficients for two widely-used policy evaluation algorithms. We establish the first sample complexity bound with high-probability convergence guarantee that attains the optimal dependence on the tolerance level.
arXiv Detail & Related papers (2023-05-30T12:58:39Z)
Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections. We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer. The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z)
Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm. We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z)
Stochastic first-order methods for average-reward Markov decision processes [10.023632561462712]
We study average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy optimization and policy evaluation. By combining the policy evaluation and policy optimization parts, we establish sample complexity results for solving AMDPs under both generative and Markovian noise models.
arXiv Detail & Related papers (2022-05-11T23:02:46Z)
MDPGT: Momentum-based Decentralized Policy Gradient Tracking [29.22173174168708]
We propose a momentum-based decentralized policy gradient tracking (MDPGT) for multi-agent reinforcement learning. MDPGT achieves the best available sample complexity of $mathcalO(N-1epsilon-3)$ for converging to an $epsilon-stationary point of the global average of $N$ local performance functions. This outperforms the state-of-the-art sample complexity in decentralized model-free reinforcement learning.
arXiv Detail & Related papers (2021-12-06T06:55:51Z)
Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning. Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z)
Implicit Distributional Reinforcement Learning [61.166030238490634]
implicit distributional actor-critic (IDAC) built on two deep generator networks (DGNs) Semi-implicit actor (SIA) powered by a flexible policy distribution. We observe IDAC outperforms state-of-the-art algorithms on representative OpenAI Gym environments.
arXiv Detail & Related papers (2020-07-13T02:52:18Z)
Linear-Quadratic Mean-Field Reinforcement Learning: Convergence of Policy Gradient Methods [2.330509865741341]
We investigate reinforcement learning in the setting of Markov decision processes for a large number of exchangeable agents interacting in a mean field manner. An approximate solution is obtained by learning the optimal policy of a generic agent interacting with the statistical distribution of the states and actions of the other agents.
arXiv Detail & Related papers (2019-10-09T23:19:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.