Ensemble Value Functions for Efficient Exploration in Multi-Agent Reinforcement Learning
- URL: http://arxiv.org/abs/2302.03439v6
- Date: Tue, 16 Apr 2024 16:13:00 GMT
- Title: Ensemble Value Functions for Efficient Exploration in Multi-Agent Reinforcement Learning
- Authors: Lukas Schäfer, Oliver Slumbers, Stephen McAleer, Yali Du, Stefano V. Albrecht, David Mguni,
- Abstract summary: EMAX is a framework to seamlessly extend value-based MARL algorithms with ensembles of value functions.
EMAX uses the uncertainty of value estimates across the ensemble in a UCB policy to guide the exploration.
During optimisation, EMAX computes target values as average value estimates across the ensemble.
During evaluation, EMAX selects actions following a majority vote across the ensemble, which reduces the likelihood of selecting sub-optimal actions.
- Score: 18.762198598488066
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing value-based algorithms for cooperative multi-agent reinforcement learning (MARL) commonly rely on random exploration, such as $\epsilon$-greedy, to explore the environment. However, such exploration is inefficient at finding effective joint actions in states that require cooperation of multiple agents. In this work, we propose ensemble value functions for multi-agent exploration (EMAX), a general framework to seamlessly extend value-based MARL algorithms with ensembles of value functions. EMAX leverages the ensemble of value functions to guide the exploration of agents, stabilises their optimisation, and makes their policies more robust to miscoordination. These benefits are achieved by using a combination of three techniques. (1) EMAX uses the uncertainty of value estimates across the ensemble in a UCB policy to guide the exploration. This exploration policy focuses on parts of the environment which require cooperation across agents and, thus, enables agents to more efficiently learn how to cooperate. (2) During the optimisation, EMAX computes target values as average value estimates across the ensemble. These targets exhibit lower variance compared to commonly applied target networks, leading to significant benefits in MARL which commonly suffers from high variance caused by the exploration and non-stationary policies of other agents. (3) During evaluation, EMAX selects actions following a majority vote across the ensemble, which reduces the likelihood of selecting sub-optimal actions. We instantiate three value-based MARL algorithms with EMAX, independent DQN, VDN and QMIX, and evaluate them in 21 tasks across four environments. Using ensembles of five value functions, EMAX improves sample efficiency and final evaluation returns of these algorithms by 60%, 47%, and 539%, respectively, averaged across 21 tasks.
Related papers
- Mixed Q-Functionals: Advancing Value-Based Methods in Cooperative MARL
with Continuous Action Domains [0.0]
We propose a novel multi-agent value-based algorithm, Mixed Q-Functionals (MQF), inspired by the idea of Q-Functionals.
Our algorithm fosters collaboration among agents by mixing their action-values.
Our empirical findings reveal that MQF outperforms four variants of Deep Deterministic Policy Gradient.
arXiv Detail & Related papers (2024-02-12T16:21:50Z) - Maximize to Explore: One Objective Function Fusing Estimation, Planning,
and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX.
textttMEX integrates estimation and planning components while balancing exploration exploitation automatically.
It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z) - Mastering the exploration-exploitation trade-off in Bayesian
Optimization [0.2538209532048867]
The acquisition function drives the choice of the next solution to evaluate, balancing between exploration and exploitation.
This paper proposes a novel acquisition function, mastering the trade-off between explorative and exploitative choices, adaptively.
arXiv Detail & Related papers (2023-05-15T13:19:03Z) - Conditionally Optimistic Exploration for Cooperative Deep Multi-Agent
Reinforcement Learning [24.05715475457959]
Efficient exploration is critical in cooperative deep Multi-Agent Reinforcement Learning (MARL)
In this work, we propose an exploration method that effectively encourages cooperative exploration based on the idea of sequential action-computation.
arXiv Detail & Related papers (2023-03-16T02:05:16Z) - Algorithmic Foundations of Empirical X-risk Minimization [51.58884973792057]
This manuscript introduces a new optimization framework machine learning and AI, named bf empirical X-risk baseline (EXM).
X-risk is a term introduced to represent a family of compositional measures or objectives.
arXiv Detail & Related papers (2022-06-01T12:22:56Z) - Softmax with Regularization: Better Value Estimation in Multi-Agent
Reinforcement Learning [72.28520951105207]
Overestimation in $Q$-learning is an important problem that has been extensively studied in single-agent reinforcement learning.
We propose a novel regularization-based update scheme that penalizes large joint action-values deviating from a baseline.
We show that our method provides a consistent performance improvement on a set of challenging StarCraft II micromanagement tasks.
arXiv Detail & Related papers (2021-03-22T14:18:39Z) - Modeling the Interaction between Agents in Cooperative Multi-Agent
Reinforcement Learning [2.9360071145551068]
We propose a novel cooperative MARL algorithm named as interactive actor-critic(IAC)
IAC models the interaction of agents from perspectives of policy and value function.
We extend the value decomposition methods to continuous control tasks and evaluate IAC on benchmark tasks including classic control and multi-agent particle environments.
arXiv Detail & Related papers (2021-02-10T01:58:28Z) - Attention Actor-Critic algorithm for Multi-Agent Constrained
Co-operative Reinforcement Learning [3.296127938396392]
We consider the problem of computing optimal actions for Reinforcement Learning (RL) agents in a co-operative setting.
We extend this algorithm to the constrained multi-agent RL setting.
arXiv Detail & Related papers (2021-01-07T03:21:15Z) - UneVEn: Universal Value Exploration for Multi-Agent Reinforcement
Learning [53.73686229912562]
We propose a novel MARL approach called Universal Value Exploration (UneVEn)
UneVEn learns a set of related tasks simultaneously with a linear decomposition of universal successor features.
Empirical results on a set of exploration games, challenging cooperative predator-prey tasks requiring significant coordination among agents, and StarCraft II micromanagement benchmarks show that UneVEn can solve tasks where other state-of-the-art MARL methods fail.
arXiv Detail & Related papers (2020-10-06T19:08:47Z) - Zeroth-Order Supervised Policy Improvement [94.0748002906652]
Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL)
We propose Zeroth-Order Supervised Policy Improvement (ZOSPI)
ZOSPI exploits the estimated value function $Q$ globally while preserving the local exploitation of the PG methods.
arXiv Detail & Related papers (2020-06-11T16:49:23Z) - FACMAC: Factored Multi-Agent Centralised Policy Gradients [103.30380537282517]
We propose FACtored Multi-Agent Centralised policy gradients (FACMAC)
It is a new method for cooperative multi-agent reinforcement learning in both discrete and continuous action spaces.
We evaluate FACMAC on variants of the multi-agent particle environments, a novel multi-agent MuJoCo benchmark, and a challenging set of StarCraft II micromanagement tasks.
arXiv Detail & Related papers (2020-03-14T21:29:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.