A Decentralized Policy with Logarithmic Regret for a Class of
Multi-Agent Multi-Armed Bandit Problems with Option Unavailability
Constraints and Stochastic Communication Protocols
- URL: http://arxiv.org/abs/2003.12968v2
- Date: Tue, 31 Mar 2020 04:29:22 GMT
- Title: A Decentralized Policy with Logarithmic Regret for a Class of
Multi-Agent Multi-Armed Bandit Problems with Option Unavailability
Constraints and Stochastic Communication Protocols
- Authors: Pathmanathan Pankayaraj, D. H. S. Maithripala, and J. M. Berg
- Abstract summary: This paper considers a multi-armed bandit (MAB) problem in which multiple mobile agents receive rewards by sampling from a collection of spatially distributed processes, called bandits.
The goal is to formulate a decentralized policy for each agent, in order to maximize the total cumulative reward over all agents, subject to option availability and inter-agent communication constraints.
To our knowledge, this is the first such decentralized policy for non-fully connected spatial graphs with communication constraints.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper considers a multi-armed bandit (MAB) problem in which multiple
mobile agents receive rewards by sampling from a collection of spatially
dispersed stochastic processes, called bandits. The goal is to formulate a
decentralized policy for each agent, in order to maximize the total cumulative
reward over all agents, subject to option availability and inter-agent
communication constraints. The problem formulation is motivated by applications
in which a team of autonomous mobile robots cooperates to accomplish an
exploration and exploitation task in an uncertain environment. Bandit locations
are represented by vertices of the spatial graph. At any time, an agent's
option consist of sampling the bandit at its current location, or traveling
along an edge of the spatial graph to a new bandit location. Communication
constraints are described by a directed, non-stationary, stochastic
communication graph. At any time, agents may receive data only from their
communication graph in-neighbors. For the case of a single agent on a fully
connected spatial graph, it is known that the expected regret for any optimal
policy is necessarily bounded below by a function that grows as the logarithm
of time. A class of policies called upper confidence bound (UCB) algorithms
asymptotically achieve logarithmic regret for the classical MAB problem. In
this paper, we propose a UCB-based decentralized motion and option selection
policy and a non-stationary stochastic communication protocol that guarantee
logarithmic regret. To our knowledge, this is the first such decentralized
policy for non-fully connected spatial graphs with communication constraints.
When the spatial graph is fully connected and the communication graph is
stationary, our decentralized algorithm matches or exceeds the best reported
prior results from the literature.
Related papers
- Networked Communication for Mean-Field Games with Function Approximation and Empirical Mean-Field Estimation [59.01527054553122]
Decentralised agents can learn equilibria in Mean-Field Games from a single, non-episodic run of the empirical system.
We introduce function approximation to the existing setting, drawing on the Munchausen Online Mirror Descent method.
We additionally provide new algorithms that allow agents to estimate the global empirical distribution based on a local neighbourhood.
arXiv Detail & Related papers (2024-08-21T13:32:46Z) - Distributed Optimization via Kernelized Multi-armed Bandits [6.04275169308491]
We model a distributed optimization problem as a multi-agent kernelized multi-armed bandit problem with a heterogeneous reward setting.
We present a fully decentralized algorithm, Multi-agent IGP-UCB (MA-IGP-UCB), which achieves a sub-linear regret bound for popular classes for kernels.
We also propose an extension, Multi-agent Delayed IGP-UCB (MAD-IGP-UCB) algorithm, which reduces the dependence of the regret bound on the number of agents in the network.
arXiv Detail & Related papers (2023-12-07T21:57:48Z) - RGMComm: Return Gap Minimization via Discrete Communications in
Multi-Agent Reinforcement Learning [33.86277578441437]
Communication is crucial for solving cooperative Multi-Agent Reinforcement Learning tasks in partially observable Markov Decision Processes.
We propose the Return-Gap-Minimization Communication (RGMComm) algorithm, which is a surprisingly simple design of discrete message generation functions.
Evaluations show that RGMComm significantly outperforms state-of-the-art multi-agent communication baselines.
arXiv Detail & Related papers (2023-08-07T07:26:55Z) - Scalable Multi-agent Covering Option Discovery based on Kronecker Graphs [49.71319907864573]
In this paper, we propose multi-agent skill discovery which enables the ease of decomposition.
Our key idea is to approximate the joint state space as a Kronecker graph, based on which we can directly estimate its Fiedler vector.
Considering that directly computing the Laplacian spectrum is intractable for tasks with infinite-scale state spaces, we further propose a deep learning extension of our method.
arXiv Detail & Related papers (2023-07-21T14:53:12Z) - QuTE: decentralized multiple testing on sensor networks with false
discovery rate control [130.7122910646076]
This paper designs methods for decentralized multiple hypothesis testing on graphs equipped with provable guarantees on the false discovery rate (FDR)
We consider the setting where distinct agents reside on the nodes of an undirected graph, and each agent possesses p-values corresponding to one or more hypotheses local to its node.
Each agent must individually decide whether to reject one or more of its local hypotheses by only communicating with its neighbors, with the joint aim that the global FDR over the entire graph must be controlled at a predefined level.
arXiv Detail & Related papers (2022-10-09T19:48:39Z) - Optimal Clustering with Bandit Feedback [57.672609011609886]
This paper considers the problem of online clustering with bandit feedback.
It includes a novel stopping rule for sequential testing that circumvents the need to solve any NP-hard weighted clustering problem as its subroutines.
We show through extensive simulations on synthetic and real-world datasets that BOC's performance matches the lower boundally, and significantly outperforms a non-adaptive baseline algorithm.
arXiv Detail & Related papers (2022-02-09T06:05:05Z) - Decentralized Optimization Over the Stiefel Manifold by an Approximate
Augmented Lagrangian Function [7.768673476492471]
This paper focuses on the decentralized optimization problem over the Stiefel manifold.
Agents can only communicate with their neighbors in a collaborative effort to solve this problem.
In existing methods, multiple rounds of communications are required to guarantee the convergence.
This paper proposes a decentralized algorithm, called DESTINY, which only invokes a single round of communications per iteration.
arXiv Detail & Related papers (2021-12-30T07:19:59Z) - Distributed Adaptive Learning Under Communication Constraints [54.22472738551687]
This work examines adaptive distributed learning strategies designed to operate under communication constraints.
We consider a network of agents that must solve an online optimization problem from continual observation of streaming data.
arXiv Detail & Related papers (2021-12-03T19:23:48Z) - Decentralized Multi-Armed Bandit Can Outperform Classic Upper Confidence Bound: A Homogeneous Case over Strongly Connected Graphs [9.84486119211443]
This paper studies a homogeneous decentralized multi-armed bandit problem, in which a network of multiple agents faces the same set of arms.
A fully decentralized upper bound confidence (UCB) algorithm is proposed for a multi-agent network whose neighbor relations are described by a directed graph.
arXiv Detail & Related papers (2021-11-22T01:05:52Z) - Acceleration in Distributed Optimization Under Similarity [72.54787082152278]
We study distributed (strongly convex) optimization problems over a network of agents, with no centralized nodes.
An $varepsilon$-solution is achieved in $tildemathcalrhoObig(sqrtfracbeta/mu (1-)log1/varepsilonbig)$ number of communications steps.
This rate matches (up to poly-log factors) for the first time lower complexity communication bounds of distributed gossip-algorithms applied to the class of problems of interest.
arXiv Detail & Related papers (2021-10-24T04:03:00Z) - Dimension-Free Rates for Natural Policy Gradient in Multi-Agent
Reinforcement Learning [22.310861786709538]
We propose a scalable algorithm for cooperative multi-agent reinforcement learning.
We show that our algorithm converges to the globally optimal policy with a dimension-free statistical and computational complexity.
arXiv Detail & Related papers (2021-09-23T23:38:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.