Ensemble Value Functions for Efficient Exploration in Multi-Agent Reinforcement Learning
- URL: http://arxiv.org/abs/2302.03439v6
- Date: Tue, 16 Apr 2024 16:13:00 GMT
- Title: Ensemble Value Functions for Efficient Exploration in Multi-Agent Reinforcement Learning
- Authors: Lukas Schäfer, Oliver Slumbers, Stephen McAleer, Yali Du, Stefano V. Albrecht, David Mguni,
- Abstract summary: EMAX is a framework to seamlessly extend value-based MARL algorithms with ensembles of value functions.
EMAX uses the uncertainty of value estimates across the ensemble in a UCB policy to guide the exploration.
During optimisation, EMAX computes target values as average value estimates across the ensemble.
During evaluation, EMAX selects actions following a majority vote across the ensemble, which reduces the likelihood of selecting sub-optimal actions.
- Score: 18.762198598488066
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing value-based algorithms for cooperative multi-agent reinforcement learning (MARL) commonly rely on random exploration, such as $\epsilon$-greedy, to explore the environment. However, such exploration is inefficient at finding effective joint actions in states that require cooperation of multiple agents. In this work, we propose ensemble value functions for multi-agent exploration (EMAX), a general framework to seamlessly extend value-based MARL algorithms with ensembles of value functions. EMAX leverages the ensemble of value functions to guide the exploration of agents, stabilises their optimisation, and makes their policies more robust to miscoordination. These benefits are achieved by using a combination of three techniques. (1) EMAX uses the uncertainty of value estimates across the ensemble in a UCB policy to guide the exploration. This exploration policy focuses on parts of the environment which require cooperation across agents and, thus, enables agents to more efficiently learn how to cooperate. (2) During the optimisation, EMAX computes target values as average value estimates across the ensemble. These targets exhibit lower variance compared to commonly applied target networks, leading to significant benefits in MARL which commonly suffers from high variance caused by the exploration and non-stationary policies of other agents. (3) During evaluation, EMAX selects actions following a majority vote across the ensemble, which reduces the likelihood of selecting sub-optimal actions. We instantiate three value-based MARL algorithms with EMAX, independent DQN, VDN and QMIX, and evaluate them in 21 tasks across four environments. Using ensembles of five value functions, EMAX improves sample efficiency and final evaluation returns of these algorithms by 60%, 47%, and 539%, respectively, averaged across 21 tasks.
Related papers
- Policy Optimization and Multi-agent Reinforcement Learning for Mean-variance Team Stochastic Games [1.430310470698995]
We study a long-run mean-variance team game (MV-TSG)
MV-TSG has two main challenges. First, the variance metric is neither additive nor Markovian in a dynamic setting.
We propose a Mean-Variance Multi-Agent Policy Iteration (MV-MAPI) algorithm with a sequential update scheme.
We derive specific conditions for stationary points to be Nash equilibria, and further, strict local optima.
arXiv Detail & Related papers (2025-03-28T16:21:05Z) - Collab: Controlled Decoding using Mixture of Agents for LLM Alignment [90.6117569025754]
Reinforcement learning from human feedback has emerged as an effective technique to align Large Language models.
Controlled Decoding provides a mechanism for aligning a model at inference time without retraining.
We propose a mixture of agent-based decoding strategies leveraging the existing off-the-shelf aligned LLM policies.
arXiv Detail & Related papers (2025-03-27T17:34:25Z) - Multi-Agent Inverse Q-Learning from Demonstrations [3.4136908117644698]
Multi-Agent Marginal Q-Learning from Demonstrations (MAMQL) is a novel sample-efficient framework for multi-agent IRL.
We show MAMQL significantly outperforms previous multi-agent methods in average reward, sample efficiency, and reward recovery by often more than 2-5x.
arXiv Detail & Related papers (2025-03-06T18:22:29Z) - Soft-QMIX: Integrating Maximum Entropy For Monotonic Value Function Factorization [5.54284350152423]
We propose an enhancement to QMIX by incorporating an additional local Qvalue learning method within the maximum entropy RL framework.
Our approach constrains the local Q-value estimates to maintain the correct ordering of all actions.
We theoretically prove the monotonic improvement and convergence of our method to an optimal solution.
arXiv Detail & Related papers (2024-06-20T01:55:08Z) - Mixed Q-Functionals: Advancing Value-Based Methods in Cooperative MARL
with Continuous Action Domains [0.0]
We propose a novel multi-agent value-based algorithm, Mixed Q-Functionals (MQF), inspired by the idea of Q-Functionals.
Our algorithm fosters collaboration among agents by mixing their action-values.
Our empirical findings reveal that MQF outperforms four variants of Deep Deterministic Policy Gradient.
arXiv Detail & Related papers (2024-02-12T16:21:50Z) - Maximize to Explore: One Objective Function Fusing Estimation, Planning,
and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX.
textttMEX integrates estimation and planning components while balancing exploration exploitation automatically.
It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z) - Mastering the exploration-exploitation trade-off in Bayesian
Optimization [0.2538209532048867]
The acquisition function drives the choice of the next solution to evaluate, balancing between exploration and exploitation.
This paper proposes a novel acquisition function, mastering the trade-off between explorative and exploitative choices, adaptively.
arXiv Detail & Related papers (2023-05-15T13:19:03Z) - Conditionally Optimistic Exploration for Cooperative Deep Multi-Agent
Reinforcement Learning [24.05715475457959]
Efficient exploration is critical in cooperative deep Multi-Agent Reinforcement Learning (MARL)
In this work, we propose an exploration method that effectively encourages cooperative exploration based on the idea of sequential action-computation.
arXiv Detail & Related papers (2023-03-16T02:05:16Z) - Algorithmic Foundations of Empirical X-risk Minimization [51.58884973792057]
This manuscript introduces a new optimization framework machine learning and AI, named bf empirical X-risk baseline (EXM).
X-risk is a term introduced to represent a family of compositional measures or objectives.
arXiv Detail & Related papers (2022-06-01T12:22:56Z) - Efficient Algorithms for Extreme Bandits [20.68824391770909]
We contribute to the Extreme Bandit problem, a variant of Multi-Armed Bandits in which the learner seeks to collect the largest possible reward.
We first study the concentration of the maximum of i.i.d random variables under mild assumptions on the tail of the rewards distributions.
We then propose and analyze a more adaptive, anytime algorithm, QoMax-SDA, which combines QoMax with a subsampling method recently introduced by Baudry et al.
arXiv Detail & Related papers (2022-03-21T11:09:34Z) - Min-Max Bilevel Multi-objective Optimization with Applications in
Machine Learning [30.25074797092709]
This paper is the first to propose a min-max bilevel multi-objective optimization framework.
It highlights applications in representation learning and hyper-objective learning.
arXiv Detail & Related papers (2022-03-03T18:56:13Z) - Softmax with Regularization: Better Value Estimation in Multi-Agent
Reinforcement Learning [72.28520951105207]
Overestimation in $Q$-learning is an important problem that has been extensively studied in single-agent reinforcement learning.
We propose a novel regularization-based update scheme that penalizes large joint action-values deviating from a baseline.
We show that our method provides a consistent performance improvement on a set of challenging StarCraft II micromanagement tasks.
arXiv Detail & Related papers (2021-03-22T14:18:39Z) - Modeling the Interaction between Agents in Cooperative Multi-Agent
Reinforcement Learning [2.9360071145551068]
We propose a novel cooperative MARL algorithm named as interactive actor-critic(IAC)
IAC models the interaction of agents from perspectives of policy and value function.
We extend the value decomposition methods to continuous control tasks and evaluate IAC on benchmark tasks including classic control and multi-agent particle environments.
arXiv Detail & Related papers (2021-02-10T01:58:28Z) - Attention Actor-Critic algorithm for Multi-Agent Constrained
Co-operative Reinforcement Learning [3.296127938396392]
We consider the problem of computing optimal actions for Reinforcement Learning (RL) agents in a co-operative setting.
We extend this algorithm to the constrained multi-agent RL setting.
arXiv Detail & Related papers (2021-01-07T03:21:15Z) - UneVEn: Universal Value Exploration for Multi-Agent Reinforcement
Learning [53.73686229912562]
We propose a novel MARL approach called Universal Value Exploration (UneVEn)
UneVEn learns a set of related tasks simultaneously with a linear decomposition of universal successor features.
Empirical results on a set of exploration games, challenging cooperative predator-prey tasks requiring significant coordination among agents, and StarCraft II micromanagement benchmarks show that UneVEn can solve tasks where other state-of-the-art MARL methods fail.
arXiv Detail & Related papers (2020-10-06T19:08:47Z) - Weighted QMIX: Expanding Monotonic Value Function Factorisation for Deep
Multi-Agent Reinforcement Learning [66.94149388181343]
We present a new version of a popular $Q$-learning algorithm for MARL.
We show that it can recover the optimal policy even with access to $Q*$.
We also demonstrate improved performance on predator-prey and challenging multi-agent StarCraft benchmark tasks.
arXiv Detail & Related papers (2020-06-18T18:34:50Z) - Zeroth-Order Supervised Policy Improvement [94.0748002906652]
Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL)
We propose Zeroth-Order Supervised Policy Improvement (ZOSPI)
ZOSPI exploits the estimated value function $Q$ globally while preserving the local exploitation of the PG methods.
arXiv Detail & Related papers (2020-06-11T16:49:23Z) - Dynamic Multi-Robot Task Allocation under Uncertainty and Temporal
Constraints [52.58352707495122]
We present a multi-robot allocation algorithm that decouples the key computational challenges of sequential decision-making under uncertainty and multi-agent coordination.
We validate our results over a wide range of simulations on two distinct domains: multi-arm conveyor belt pick-and-place and multi-drone delivery dispatch in a city.
arXiv Detail & Related papers (2020-05-27T01:10:41Z) - FACMAC: Factored Multi-Agent Centralised Policy Gradients [103.30380537282517]
We propose FACtored Multi-Agent Centralised policy gradients (FACMAC)
It is a new method for cooperative multi-agent reinforcement learning in both discrete and continuous action spaces.
We evaluate FACMAC on variants of the multi-agent particle environments, a novel multi-agent MuJoCo benchmark, and a challenging set of StarCraft II micromanagement tasks.
arXiv Detail & Related papers (2020-03-14T21:29:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.