Taming Multi-Agent Reinforcement Learning with Estimator Variance
Reduction
- URL: http://arxiv.org/abs/2209.01054v2
- Date: Thu, 22 Jun 2023 14:19:41 GMT
- Title: Taming Multi-Agent Reinforcement Learning with Estimator Variance
Reduction
- Authors: Taher Jafferjee, Juliusz Ziomek, Tianpei Yang, Zipeng Dai, Jianhong
Wang, Matthew Taylor, Kun Shao, Jun Wang, David Mguni
- Abstract summary: Centralised training with decentralised execution (CT-DE) serves as the foundation of many leading multi-agent reinforcement learning (MARL) algorithms.
It suffers from a critical drawback due to its reliance on learning from a single sample of the joint-action at a given state.
We propose an enhancement tool that accommodates any actor-critic MARL method.
- Score: 12.94372063457462
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Centralised training with decentralised execution (CT-DE) serves as the
foundation of many leading multi-agent reinforcement learning (MARL)
algorithms. Despite its popularity, it suffers from a critical drawback due to
its reliance on learning from a single sample of the joint-action at a given
state. As agents explore and update their policies during training, these
single samples may poorly represent the actual joint-policy of the system of
agents leading to high variance gradient estimates that hinder learning. To
address this problem, we propose an enhancement tool that accommodates any
actor-critic MARL method. Our framework, Performance Enhancing Reinforcement
Learning Apparatus (PERLA), introduces a sampling technique of the agents'
joint-policy into the critics while the agents train. This leads to TD updates
that closely approximate the true expected value under the current joint-policy
rather than estimates from a single sample of the joint-action at a given
state. This produces low variance and precise estimates of expected returns,
minimising the variance in the critic estimators which typically hinders
learning. Moreover, as we demonstrate, by eliminating much of the critic
variance from the single sampling of the joint policy, PERLA enables CT-DE
methods to scale more efficiently with the number of agents. Theoretically, we
prove that PERLA reduces variance in value estimates similar to that of
decentralised training while maintaining the benefits of centralised training.
Empirically, we demonstrate PERLA's superior performance and ability to reduce
estimator variance in a range of benchmarks including Multi-agent Mujoco, and
StarCraft II Multi-agent Challenge.
Related papers
- From Novice to Expert: LLM Agent Policy Optimization via Step-wise Reinforcement Learning [62.54484062185869]
We introduce StepAgent, which utilizes step-wise reward to optimize the agent's reinforcement learning process.
We propose implicit-reward and inverse reinforcement learning techniques to facilitate agent reflection and policy adjustment.
arXiv Detail & Related papers (2024-11-06T10:35:11Z) - Imitation Learning by State-Only Distribution Matching [2.580765958706854]
Imitation Learning from observation describes policy learning in a similar way to human learning.
We propose a non-adversarial learning-from-observations approach, together with an interpretable convergence and performance metric.
arXiv Detail & Related papers (2022-02-09T08:38:50Z) - Monotonic Improvement Guarantees under Non-stationarity for
Decentralized PPO [66.5384483339413]
We present a new monotonic improvement guarantee for optimizing decentralized policies in cooperative Multi-Agent Reinforcement Learning (MARL)
We show that a trust region constraint can be effectively enforced in a principled way by bounding independent ratios based on the number of agents in training.
arXiv Detail & Related papers (2022-01-31T20:39:48Z) - Learning Cooperative Multi-Agent Policies with Partial Reward Decoupling [13.915157044948364]
One of the preeminent obstacles to scaling multi-agent reinforcement learning is assigning credit to individual agents' actions.
In this paper, we address this credit assignment problem with an approach that we call textitpartial reward decoupling (PRD)
PRD decomposes large cooperative multi-agent RL problems into decoupled subproblems involving subsets of agents, thereby simplifying credit assignment.
arXiv Detail & Related papers (2021-12-23T17:48:04Z) - Evaluating Generalization and Transfer Capacity of Multi-Agent
Reinforcement Learning Across Variable Number of Agents [0.0]
Multi-agent Reinforcement Learning (MARL) problems often require cooperation among agents in order to solve a task.
Centralization and decentralization are two approaches used for cooperation in MARL.
We adopt centralized training with decentralized execution paradigm and investigate the generalization and transfer capacity of the trained models across variable number of agents.
arXiv Detail & Related papers (2021-11-28T15:29:46Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - Local Advantage Actor-Critic for Robust Multi-Agent Deep Reinforcement
Learning [19.519440854957633]
We propose a new multi-agent policy gradient method called Robust Local Advantage (ROLA) Actor-Critic.
ROLA allows each agent to learn an individual action-value function as a local critic as well as ameliorating environment non-stationarity.
We show ROLA's robustness and effectiveness over a number of state-of-the-art multi-agent policy gradient algorithms.
arXiv Detail & Related papers (2021-10-16T19:03:34Z) - Estimation Error Correction in Deep Reinforcement Learning for
Deterministic Actor-Critic Methods [0.0]
In value-based deep reinforcement learning methods, approximation of value functions induces overestimation bias and leads to suboptimal policies.
We show that in deep actor-critic methods that aim to overcome the overestimation bias, if the reinforcement signals received by the agent have a high variance, a significant underestimation bias arises.
To minimize the underestimation, we introduce a parameter-free, novel deep Q-learning variant.
arXiv Detail & Related papers (2021-09-22T13:49:35Z) - Scalable Evaluation of Multi-Agent Reinforcement Learning with Melting
Pot [71.28884625011987]
Melting Pot is a MARL evaluation suite that uses reinforcement learning to reduce the human labor required to create novel test scenarios.
We have created over 80 unique test scenarios covering a broad range of research topics.
We apply these test scenarios to standard MARL training algorithms, and demonstrate how Melting Pot reveals weaknesses not apparent from training performance alone.
arXiv Detail & Related papers (2021-07-14T17:22:14Z) - Softmax with Regularization: Better Value Estimation in Multi-Agent
Reinforcement Learning [72.28520951105207]
Overestimation in $Q$-learning is an important problem that has been extensively studied in single-agent reinforcement learning.
We propose a novel regularization-based update scheme that penalizes large joint action-values deviating from a baseline.
We show that our method provides a consistent performance improvement on a set of challenging StarCraft II micromanagement tasks.
arXiv Detail & Related papers (2021-03-22T14:18:39Z) - Is Independent Learning All You Need in the StarCraft Multi-Agent
Challenge? [100.48692829396778]
Independent PPO (IPPO) is a form of independent learning in which each agent simply estimates its local value function.
IPPO's strong performance may be due to its robustness to some forms of environment non-stationarity.
arXiv Detail & Related papers (2020-11-18T20:29:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.