Decentralized Multi-Agent Reinforcement Learning: An Off-Policy Method
- URL: http://arxiv.org/abs/2111.00438v1
- Date: Sun, 31 Oct 2021 09:08:46 GMT
- Title: Decentralized Multi-Agent Reinforcement Learning: An Off-Policy Method
- Authors: Kuo Li, Qing-Shan Jia
- Abstract summary: We discuss the problem of decentralized multi-agent reinforcement learning (MARL) in this work.
In our setting, the global state, action, and reward are assumed to be fully observable, while the local policy is protected as privacy by each agent, and thus cannot be shared with others.
The policy evaluation and policy improvement algorithms are designed for discrete and continuous state-action-space Markov Decision Process (MDP) respectively.
- Score: 6.261762915564555
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We discuss the problem of decentralized multi-agent reinforcement learning
(MARL) in this work. In our setting, the global state, action, and reward are
assumed to be fully observable, while the local policy is protected as privacy
by each agent, and thus cannot be shared with others. There is a communication
graph, among which the agents can exchange information with their neighbors.
The agents make individual decisions and cooperate to reach a higher
accumulated reward.
Towards this end, we first propose a decentralized actor-critic (AC) setting.
Then, the policy evaluation and policy improvement algorithms are designed for
discrete and continuous state-action-space Markov Decision Process (MDP)
respectively. Furthermore, convergence analysis is given under the
discrete-space case, which guarantees that the policy will be reinforced by
alternating between the processes of policy evaluation and policy improvement.
In order to validate the effectiveness of algorithms, we design experiments and
compare them with previous algorithms, e.g., Q-learning \cite{watkins1992q} and
MADDPG \cite{lowe2017multi}. The results show that our algorithms perform
better from the aspects of both learning speed and final performance. Moreover,
the algorithms can be executed in an off-policy manner, which greatly improves
the data efficiency compared with on-policy algorithms.
Related papers
- Approximate Linear Programming for Decentralized Policy Iteration in Cooperative Multi-agent Markov Decision Processes [5.842054972839244]
We consider a cooperative multi-agent Markov decision process involving m agents.
In the policy iteration process of multi-agent setup, the number of actions grows exponentially with the number of agents.
We propose approximate decentralized policy iteration algorithms using approximate linear programming with function approximation.
arXiv Detail & Related papers (2023-11-20T14:14:13Z) - Local Optimization Achieves Global Optimality in Multi-Agent
Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.
We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Distributed-Training-and-Execution Multi-Agent Reinforcement Learning
for Power Control in HetNet [48.96004919910818]
We propose a multi-agent deep reinforcement learning (MADRL) based power control scheme for the HetNet.
To promote cooperation among agents, we develop a penalty-based Q learning (PQL) algorithm for MADRL systems.
In this way, an agent's policy can be learned by other agents more easily, resulting in a more efficient collaboration process.
arXiv Detail & Related papers (2022-12-15T17:01:56Z) - Scalable and Sample Efficient Distributed Policy Gradient Algorithms in
Multi-Agent Networked Systems [12.327745531583277]
We name it REC-MARL standing for REward-Coupled Multi-Agent Reinforcement Learning.
REC-MARL has a range of important applications such as real-time access control and distributed power control in wireless networks.
arXiv Detail & Related papers (2022-12-13T03:44:00Z) - Learning Optimal Antenna Tilt Control Policies: A Contextual Linear
Bandit Approach [65.27783264330711]
Controlling antenna tilts in cellular networks is imperative to reach an efficient trade-off between network coverage and capacity.
We devise algorithms learning optimal tilt control policies from existing data.
We show that they can produce optimal tilt update policy using much fewer data samples than naive or existing rule-based learning algorithms.
arXiv Detail & Related papers (2022-01-06T18:24:30Z) - Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs.
bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad.
We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z) - Multi-Agent Trust Region Policy Optimization [34.91180300856614]
We show that the policy update of TRPO can be transformed into a distributed consensus optimization problem for multi-agent cases.
We propose a decentralized MARL algorithm, which we call multi-agent TRPO (MATRPO)
arXiv Detail & Related papers (2020-10-15T17:49:47Z) - Implicit Distributional Reinforcement Learning [61.166030238490634]
implicit distributional actor-critic (IDAC) built on two deep generator networks (DGNs)
Semi-implicit actor (SIA) powered by a flexible policy distribution.
We observe IDAC outperforms state-of-the-art algorithms on representative OpenAI Gym environments.
arXiv Detail & Related papers (2020-07-13T02:52:18Z) - Cooperative Multi-Agent Reinforcement Learning with Partial Observations [16.895704973433382]
We propose a distributed zeroth-order policy optimization method for Multi-Agent Reinforcement Learning (MARL)
It allows the agents to compute the local policy gradients needed to update their local policy functions using local estimates of the global accumulated rewards.
We show that the proposed distributed zeroth-order policy optimization method with constant stepsize converges to the neighborhood of a policy that is a stationary point of the global objective function.
arXiv Detail & Related papers (2020-06-18T19:36:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.