Ensemble Value Functions for Efficient Exploration in Multi-Agent Reinforcement Learning
- URL: http://arxiv.org/abs/2302.03439v7
- Date: Thu, 06 Feb 2025 21:31:32 GMT
- Title: Ensemble Value Functions for Efficient Exploration in Multi-Agent Reinforcement Learning
- Authors: Lukas Schäfer, Oliver Slumbers, Stephen McAleer, Yali Du, Stefano V. Albrecht, David Mguni,
- Abstract summary: Multi-agent reinforcement learning (MARL) requires agents to explore within a vast joint action space.
EMAX is a framework to seamlessly extend value-based MARL algorithms.
- Score: 18.762198598488066
- License:
- Abstract: Multi-agent reinforcement learning (MARL) requires agents to explore within a vast joint action space to find joint actions that lead to coordination. Existing value-based MARL algorithms commonly rely on random exploration, such as $\epsilon$-greedy, to explore the environment which is not systematic and inefficient at identifying effective actions in multi-agent problems. Additionally, the concurrent training of the policies of multiple agents during training can render the optimisation non-stationary. This can lead to unstable value estimates, highly variant gradients, and ultimately hinder coordination between agents. To address these challenges, we propose ensemble value functions for multi-agent exploration (EMAX). EMAX is a framework to seamlessly extend value-based MARL algorithms. EMAX leverages an ensemble of value functions for each agent to guide their exploration, reduce the variance of their optimisation, and makes their policies more robust to miscoordination. EMAX achieves these benefits by (1) systematically guiding the exploration of agents with a UCB policy towards parts of the environment that require multiple agents to coordinate. (2) EMAX computes average value estimates across the ensemble as target values to reduce the variance of gradients and make optimisation more stable. (3) During evaluation, EMAX selects actions following a majority vote across the ensemble to reduce the likelihood of miscoordination. We first instantiate independent DQN with EMAX and evaluate it in 11 general-sum tasks with sparse rewards. We show that EMAX improves final evaluation returns by 185% across all tasks. We then evaluate EMAX on top of IDQN, VDN and QMIX in 21 common-reward tasks, and show that EMAX improves sample efficiency and final evaluation returns across all tasks over all three vanilla algorithms by 60%, 47%, and 538%, respectively.
Related papers
- Soft-QMIX: Integrating Maximum Entropy For Monotonic Value Function Factorization [5.54284350152423]
We propose an enhancement to QMIX by incorporating an additional local Qvalue learning method within the maximum entropy RL framework.
Our approach constrains the local Q-value estimates to maintain the correct ordering of all actions.
We theoretically prove the monotonic improvement and convergence of our method to an optimal solution.
arXiv Detail & Related papers (2024-06-20T01:55:08Z) - Robust Multi-Agent Control via Maximum Entropy Heterogeneous-Agent Reinforcement Learning [65.60470000696944]
We propose a unified framework for learning emphstochastic policies to resolve issues in multi-agent reinforcement learning.
Based on the MaxEnt framework, we propose emphHeterogeneous-Agent Soft Actor-Critic (HASAC) algorithm.
We evaluate HASAC on seven benchmarks: Bi-DexHands, Multi-Agent MuJoCo, Pursuit-Evade, StarCraft Multi-Agent Challenge, Google Research Football, Multi-Agent Particle Environment, Light Aircraft Game.
arXiv Detail & Related papers (2023-06-19T06:22:02Z) - Efficient Algorithms for Extreme Bandits [20.68824391770909]
We contribute to the Extreme Bandit problem, a variant of Multi-Armed Bandits in which the learner seeks to collect the largest possible reward.
We first study the concentration of the maximum of i.i.d random variables under mild assumptions on the tail of the rewards distributions.
We then propose and analyze a more adaptive, anytime algorithm, QoMax-SDA, which combines QoMax with a subsampling method recently introduced by Baudry et al.
arXiv Detail & Related papers (2022-03-21T11:09:34Z) - Min-Max Bilevel Multi-objective Optimization with Applications in
Machine Learning [30.25074797092709]
This paper is the first to propose a min-max bilevel multi-objective optimization framework.
It highlights applications in representation learning and hyper-objective learning.
arXiv Detail & Related papers (2022-03-03T18:56:13Z) - MMD-MIX: Value Function Factorisation with Maximum Mean Discrepancy for
Cooperative Multi-Agent Reinforcement Learning [15.972363414919279]
MMD-mix is a method that combines distributional reinforcement learning and value decomposition.
The experiments demonstrate that MMD-mix outperforms prior baselines in the Star Multi-Agent Challenge (SMAC) environment.
arXiv Detail & Related papers (2021-06-22T10:21:00Z) - Softmax with Regularization: Better Value Estimation in Multi-Agent
Reinforcement Learning [72.28520951105207]
Overestimation in $Q$-learning is an important problem that has been extensively studied in single-agent reinforcement learning.
We propose a novel regularization-based update scheme that penalizes large joint action-values deviating from a baseline.
We show that our method provides a consistent performance improvement on a set of challenging StarCraft II micromanagement tasks.
arXiv Detail & Related papers (2021-03-22T14:18:39Z) - UneVEn: Universal Value Exploration for Multi-Agent Reinforcement
Learning [53.73686229912562]
We propose a novel MARL approach called Universal Value Exploration (UneVEn)
UneVEn learns a set of related tasks simultaneously with a linear decomposition of universal successor features.
Empirical results on a set of exploration games, challenging cooperative predator-prey tasks requiring significant coordination among agents, and StarCraft II micromanagement benchmarks show that UneVEn can solve tasks where other state-of-the-art MARL methods fail.
arXiv Detail & Related papers (2020-10-06T19:08:47Z) - Weighted QMIX: Expanding Monotonic Value Function Factorisation for Deep
Multi-Agent Reinforcement Learning [66.94149388181343]
We present a new version of a popular $Q$-learning algorithm for MARL.
We show that it can recover the optimal policy even with access to $Q*$.
We also demonstrate improved performance on predator-prey and challenging multi-agent StarCraft benchmark tasks.
arXiv Detail & Related papers (2020-06-18T18:34:50Z) - Dynamic Multi-Robot Task Allocation under Uncertainty and Temporal
Constraints [52.58352707495122]
We present a multi-robot allocation algorithm that decouples the key computational challenges of sequential decision-making under uncertainty and multi-agent coordination.
We validate our results over a wide range of simulations on two distinct domains: multi-arm conveyor belt pick-and-place and multi-drone delivery dispatch in a city.
arXiv Detail & Related papers (2020-05-27T01:10:41Z) - FACMAC: Factored Multi-Agent Centralised Policy Gradients [103.30380537282517]
We propose FACtored Multi-Agent Centralised policy gradients (FACMAC)
It is a new method for cooperative multi-agent reinforcement learning in both discrete and continuous action spaces.
We evaluate FACMAC on variants of the multi-agent particle environments, a novel multi-agent MuJoCo benchmark, and a challenging set of StarCraft II micromanagement tasks.
arXiv Detail & Related papers (2020-03-14T21:29:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.