Related papers: Ensemble Value Functions for Efficient Exploration in Multi-Agent Reinforcement Learning

Ensemble Value Functions for Efficient Exploration in Multi-Agent Reinforcement Learning

URL: http://arxiv.org/abs/2302.03439v7
Date: Thu, 06 Feb 2025 21:31:32 GMT
Title: Ensemble Value Functions for Efficient Exploration in Multi-Agent Reinforcement Learning
Authors: Lukas Schäfer, Oliver Slumbers, Stephen McAleer, Yali Du, Stefano V. Albrecht, David Mguni,
Abstract summary: Multi-agent reinforcement learning (MARL) requires agents to explore within a vast joint action space.<n>EMAX is a framework to seamlessly extend value-based MARL algorithms.
Score: 18.762198598488066
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-agent reinforcement learning (MARL) requires agents to explore within a vast joint action space to find joint actions that lead to coordination. Existing value-based MARL algorithms commonly rely on random exploration, such as $\epsilon$-greedy, to explore the environment which is not systematic and inefficient at identifying effective actions in multi-agent problems. Additionally, the concurrent training of the policies of multiple agents during training can render the optimisation non-stationary. This can lead to unstable value estimates, highly variant gradients, and ultimately hinder coordination between agents. To address these challenges, we propose ensemble value functions for multi-agent exploration (EMAX). EMAX is a framework to seamlessly extend value-based MARL algorithms. EMAX leverages an ensemble of value functions for each agent to guide their exploration, reduce the variance of their optimisation, and makes their policies more robust to miscoordination. EMAX achieves these benefits by (1) systematically guiding the exploration of agents with a UCB policy towards parts of the environment that require multiple agents to coordinate. (2) EMAX computes average value estimates across the ensemble as target values to reduce the variance of gradients and make optimisation more stable. (3) During evaluation, EMAX selects actions following a majority vote across the ensemble to reduce the likelihood of miscoordination. We first instantiate independent DQN with EMAX and evaluate it in 11 general-sum tasks with sparse rewards. We show that EMAX improves final evaluation returns by 185% across all tasks. We then evaluate EMAX on top of IDQN, VDN and QMIX in 21 common-reward tasks, and show that EMAX improves sample efficiency and final evaluation returns across all tasks over all three vanilla algorithms by 60%, 47%, and 538%, respectively.

Related papers

Policy Optimization and Multi-agent Reinforcement Learning for Mean-variance Team Stochastic Games [1.430310470698995]
We study a long-run mean-variance team game (MV-TSG) MV-TSG has two main challenges. First, the variance metric is neither additive nor Markovian in a dynamic setting. We propose a Mean-Variance Multi-Agent Policy Iteration (MV-MAPI) algorithm with a sequential update scheme. We derive specific conditions for stationary points to be Nash equilibria, and further, strict local optima.
arXiv Detail & Related papers (2025-03-28T16:21:05Z)
Collab: Controlled Decoding using Mixture of Agents for LLM Alignment [90.6117569025754]
Reinforcement learning from human feedback has emerged as an effective technique to align Large Language models. Controlled Decoding provides a mechanism for aligning a model at inference time without retraining. We propose a mixture of agent-based decoding strategies leveraging the existing off-the-shelf aligned LLM policies.
arXiv Detail & Related papers (2025-03-27T17:34:25Z)
Multi-Agent Inverse Q-Learning from Demonstrations [3.4136908117644698]
Multi-Agent Marginal Q-Learning from Demonstrations (MAMQL) is a novel sample-efficient framework for multi-agent IRL. We show MAMQL significantly outperforms previous multi-agent methods in average reward, sample efficiency, and reward recovery by often more than 2-5x.
arXiv Detail & Related papers (2025-03-06T18:22:29Z)
Soft-QMIX: Integrating Maximum Entropy For Monotonic Value Function Factorization [5.54284350152423]
We propose an enhancement to QMIX by incorporating an additional local Qvalue learning method within the maximum entropy RL framework. Our approach constrains the local Q-value estimates to maintain the correct ordering of all actions. We theoretically prove the monotonic improvement and convergence of our method to an optimal solution.
arXiv Detail & Related papers (2024-06-20T01:55:08Z)
Mixed Q-Functionals: Advancing Value-Based Methods in Cooperative MARL with Continuous Action Domains [0.0]
We propose a novel multi-agent value-based algorithm, Mixed Q-Functionals (MQF), inspired by the idea of Q-Functionals. Our algorithm fosters collaboration among agents by mixing their action-values. Our empirical findings reveal that MQF outperforms four variants of Deep Deterministic Policy Gradient.
arXiv Detail & Related papers (2024-02-12T16:21:50Z)
Maximize to Explore: One Objective Function Fusing Estimation, Planning, and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX. textttMEX integrates estimation and planning components while balancing exploration exploitation automatically. It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z)
Mastering the exploration-exploitation trade-off in Bayesian Optimization [0.2538209532048867]
The acquisition function drives the choice of the next solution to evaluate, balancing between exploration and exploitation. This paper proposes a novel acquisition function, mastering the trade-off between explorative and exploitative choices, adaptively.
arXiv Detail & Related papers (2023-05-15T13:19:03Z)
Conditionally Optimistic Exploration for Cooperative Deep Multi-Agent Reinforcement Learning [24.05715475457959]
Efficient exploration is critical in cooperative deep Multi-Agent Reinforcement Learning (MARL) In this work, we propose an exploration method that effectively encourages cooperative exploration based on the idea of sequential action-computation.
arXiv Detail & Related papers (2023-03-16T02:05:16Z)
Algorithmic Foundations of Empirical X-risk Minimization [51.58884973792057]
This manuscript introduces a new optimization framework machine learning and AI, named bf empirical X-risk baseline (EXM). X-risk is a term introduced to represent a family of compositional measures or objectives.
arXiv Detail & Related papers (2022-06-01T12:22:56Z)
Efficient Algorithms for Extreme Bandits [20.68824391770909]
We contribute to the Extreme Bandit problem, a variant of Multi-Armed Bandits in which the learner seeks to collect the largest possible reward. We first study the concentration of the maximum of i.i.d random variables under mild assumptions on the tail of the rewards distributions. We then propose and analyze a more adaptive, anytime algorithm, QoMax-SDA, which combines QoMax with a subsampling method recently introduced by Baudry et al.
arXiv Detail & Related papers (2022-03-21T11:09:34Z)
Min-Max Bilevel Multi-objective Optimization with Applications in Machine Learning [30.25074797092709]
This paper is the first to propose a min-max bilevel multi-objective optimization framework. It highlights applications in representation learning and hyper-objective learning.
arXiv Detail & Related papers (2022-03-03T18:56:13Z)
Softmax with Regularization: Better Value Estimation in Multi-Agent Reinforcement Learning [72.28520951105207]
Overestimation in $Q$-learning is an important problem that has been extensively studied in single-agent reinforcement learning. We propose a novel regularization-based update scheme that penalizes large joint action-values deviating from a baseline. We show that our method provides a consistent performance improvement on a set of challenging StarCraft II micromanagement tasks.
arXiv Detail & Related papers (2021-03-22T14:18:39Z)
Modeling the Interaction between Agents in Cooperative Multi-Agent Reinforcement Learning [2.9360071145551068]
We propose a novel cooperative MARL algorithm named as interactive actor-critic(IAC) IAC models the interaction of agents from perspectives of policy and value function. We extend the value decomposition methods to continuous control tasks and evaluate IAC on benchmark tasks including classic control and multi-agent particle environments.
arXiv Detail & Related papers (2021-02-10T01:58:28Z)
Attention Actor-Critic algorithm for Multi-Agent Constrained Co-operative Reinforcement Learning [3.296127938396392]
We consider the problem of computing optimal actions for Reinforcement Learning (RL) agents in a co-operative setting. We extend this algorithm to the constrained multi-agent RL setting.
arXiv Detail & Related papers (2021-01-07T03:21:15Z)
UneVEn: Universal Value Exploration for Multi-Agent Reinforcement Learning [53.73686229912562]
We propose a novel MARL approach called Universal Value Exploration (UneVEn) UneVEn learns a set of related tasks simultaneously with a linear decomposition of universal successor features. Empirical results on a set of exploration games, challenging cooperative predator-prey tasks requiring significant coordination among agents, and StarCraft II micromanagement benchmarks show that UneVEn can solve tasks where other state-of-the-art MARL methods fail.
arXiv Detail & Related papers (2020-10-06T19:08:47Z)
Weighted QMIX: Expanding Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning [66.94149388181343]
We present a new version of a popular $Q$-learning algorithm for MARL. We show that it can recover the optimal policy even with access to $Q*$. We also demonstrate improved performance on predator-prey and challenging multi-agent StarCraft benchmark tasks.
arXiv Detail & Related papers (2020-06-18T18:34:50Z)
Zeroth-Order Supervised Policy Improvement [94.0748002906652]
Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL) We propose Zeroth-Order Supervised Policy Improvement (ZOSPI) ZOSPI exploits the estimated value function $Q$ globally while preserving the local exploitation of the PG methods.
arXiv Detail & Related papers (2020-06-11T16:49:23Z)
Dynamic Multi-Robot Task Allocation under Uncertainty and Temporal Constraints [52.58352707495122]
We present a multi-robot allocation algorithm that decouples the key computational challenges of sequential decision-making under uncertainty and multi-agent coordination. We validate our results over a wide range of simulations on two distinct domains: multi-arm conveyor belt pick-and-place and multi-drone delivery dispatch in a city.
arXiv Detail & Related papers (2020-05-27T01:10:41Z)
FACMAC: Factored Multi-Agent Centralised Policy Gradients [103.30380537282517]
We propose FACtored Multi-Agent Centralised policy gradients (FACMAC) It is a new method for cooperative multi-agent reinforcement learning in both discrete and continuous action spaces. We evaluate FACMAC on variants of the multi-agent particle environments, a novel multi-agent MuJoCo benchmark, and a challenging set of StarCraft II micromanagement tasks.
arXiv Detail & Related papers (2020-03-14T21:29:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.