Global Convergence of Localized Policy Iteration in Networked
Multi-Agent Reinforcement Learning
- URL: http://arxiv.org/abs/2211.17116v1
- Date: Wed, 30 Nov 2022 15:58:00 GMT
- Title: Global Convergence of Localized Policy Iteration in Networked
Multi-Agent Reinforcement Learning
- Authors: Yizhou Zhang, Guannan Qu, Pan Xu, Yiheng Lin, Zaiwei Chen, Adam
Wierman
- Abstract summary: We study a multi-agent reinforcement learning (MARL) problem where the agents interact over a given network.
The goal of the agents is to cooperatively maximize the average of their entropy-regularized long-term rewards.
To overcome the curse of dimensionality and to reduce communication, we propose a Localized Policy Iteration (LPI) that provably learns a near-globally-optimal policy using only local information.
- Score: 25.747559058350557
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study a multi-agent reinforcement learning (MARL) problem where the agents
interact over a given network. The goal of the agents is to cooperatively
maximize the average of their entropy-regularized long-term rewards. To
overcome the curse of dimensionality and to reduce communication, we propose a
Localized Policy Iteration (LPI) algorithm that provably learns a
near-globally-optimal policy using only local information. In particular, we
show that, despite restricting each agent's attention to only its $\kappa$-hop
neighborhood, the agents are able to learn a policy with an optimality gap that
decays polynomially in $\kappa$. In addition, we show the finite-sample
convergence of LPI to the global optimal policy, which explicitly captures the
trade-off between optimality and computational complexity in choosing $\kappa$.
Numerical simulations demonstrate the effectiveness of LPI.
Related papers
- Federated Reinforcement Learning with Constraint Heterogeneity [22.79217297480751]
We study a Federated Reinforcement Learning (FedRL) problem with constraint heterogeneity.
We show that FedNPG achieves global convergence with an $tildeO (1/sqrtT)$ rate, and FedPPO efficiently solves complicated learning tasks with the use of deep neural networks.
arXiv Detail & Related papers (2024-05-06T07:44:50Z) - Efficient Reinforcement Learning for Global Decision Making in the Presence of Local Agents at Scale [5.3526997662068085]
We study reinforcement learning for global decision-making in the presence of local agents.
In this setting, scalability has been a long-standing challenge due to the size of the state space.
We show that this learned policy converges to the optimal policy in the order of $tildeO (1/sqrtk+epsilon_k,m)$ as the number of sub-sampled agents increases.
arXiv Detail & Related papers (2024-03-01T01:49:57Z) - Federated Natural Policy Gradient and Actor Critic Methods for Multi-task Reinforcement Learning [46.28771270378047]
Federated reinforcement learning (RL) enables collaborative decision making of multiple distributed agents without sharing local data trajectories.
In this work, we consider a multi-task setting, in which each agent has its own private reward function corresponding to different tasks, while sharing the same transition kernel of the environment.
We learn a globally optimal policy that maximizes the sum of the discounted total rewards of all the agents in a decentralized manner.
arXiv Detail & Related papers (2023-11-01T00:15:18Z) - Local Optimization Achieves Global Optimality in Multi-Agent
Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.
We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z) - Distributed-Training-and-Execution Multi-Agent Reinforcement Learning
for Power Control in HetNet [48.96004919910818]
We propose a multi-agent deep reinforcement learning (MADRL) based power control scheme for the HetNet.
To promote cooperation among agents, we develop a penalty-based Q learning (PQL) algorithm for MADRL systems.
In this way, an agent's policy can be learned by other agents more easily, resulting in a more efficient collaboration process.
arXiv Detail & Related papers (2022-12-15T17:01:56Z) - Sequential Information Design: Markov Persuasion Process and Its
Efficient Reinforcement Learning [156.5667417159582]
This paper proposes a novel model of sequential information design, namely the Markov persuasion processes (MPPs)
Planning in MPPs faces the unique challenge in finding a signaling policy that is simultaneously persuasive to the myopic receivers and inducing the optimal long-term cumulative utilities of the sender.
We design a provably efficient no-regret learning algorithm, the Optimism-Pessimism Principle for Persuasion Process (OP4), which features a novel combination of both optimism and pessimism principles.
arXiv Detail & Related papers (2022-02-22T05:41:43Z) - Dimension-Free Rates for Natural Policy Gradient in Multi-Agent
Reinforcement Learning [22.310861786709538]
We propose a scalable algorithm for cooperative multi-agent reinforcement learning.
We show that our algorithm converges to the globally optimal policy with a dimension-free statistical and computational complexity.
arXiv Detail & Related papers (2021-09-23T23:38:15Z) - Locality Matters: A Scalable Value Decomposition Approach for
Cooperative Multi-Agent Reinforcement Learning [52.7873574425376]
Cooperative multi-agent reinforcement learning (MARL) faces significant scalability issues due to state and action spaces that are exponentially large in the number of agents.
We propose a novel, value-based multi-agent algorithm called LOMAQ, which incorporates local rewards in the Training Decentralized Execution paradigm.
arXiv Detail & Related papers (2021-09-22T10:08:15Z) - Online Sub-Sampling for Reinforcement Learning with General Function
Approximation [111.01990889581243]
In this paper, we establish an efficient online sub-sampling framework that measures the information gain of data points collected by an RL algorithm.
For a value-based method with complexity-bounded function class, we show that the policy only needs to be updated for $proptooperatornamepolylog(K)$ times.
In contrast to existing approaches that update the policy for at least $Omega(K)$ times, our approach drastically reduces the number of optimization calls in solving for a policy.
arXiv Detail & Related papers (2021-06-14T07:36:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.