Related papers: A Decentralized Policy with Logarithmic Regret for a Class of Multi-Agent Multi-Armed Bandit Problems with Option Unavailability Constraints and Stochastic Communication Protocols

A Decentralized Policy with Logarithmic Regret for a Class of Multi-Agent Multi-Armed Bandit Problems with Option Unavailability Constraints and Stochastic Communication Protocols

URL: http://arxiv.org/abs/2003.12968v2
Date: Tue, 31 Mar 2020 04:29:22 GMT
Title: A Decentralized Policy with Logarithmic Regret for a Class of Multi-Agent Multi-Armed Bandit Problems with Option Unavailability Constraints and Stochastic Communication Protocols
Authors: Pathmanathan Pankayaraj, D. H. S. Maithripala, and J. M. Berg
Abstract summary: This paper considers a multi-armed bandit (MAB) problem in which multiple mobile agents receive rewards by sampling from a collection of spatially distributed processes, called bandits. The goal is to formulate a decentralized policy for each agent, in order to maximize the total cumulative reward over all agents, subject to option availability and inter-agent communication constraints. To our knowledge, this is the first such decentralized policy for non-fully connected spatial graphs with communication constraints.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper considers a multi-armed bandit (MAB) problem in which multiple mobile agents receive rewards by sampling from a collection of spatially dispersed stochastic processes, called bandits. The goal is to formulate a decentralized policy for each agent, in order to maximize the total cumulative reward over all agents, subject to option availability and inter-agent communication constraints. The problem formulation is motivated by applications in which a team of autonomous mobile robots cooperates to accomplish an exploration and exploitation task in an uncertain environment. Bandit locations are represented by vertices of the spatial graph. At any time, an agent's option consist of sampling the bandit at its current location, or traveling along an edge of the spatial graph to a new bandit location. Communication constraints are described by a directed, non-stationary, stochastic communication graph. At any time, agents may receive data only from their communication graph in-neighbors. For the case of a single agent on a fully connected spatial graph, it is known that the expected regret for any optimal policy is necessarily bounded below by a function that grows as the logarithm of time. A class of policies called upper confidence bound (UCB) algorithms asymptotically achieve logarithmic regret for the classical MAB problem. In this paper, we propose a UCB-based decentralized motion and option selection policy and a non-stationary stochastic communication protocol that guarantee logarithmic regret. To our knowledge, this is the first such decentralized policy for non-fully connected spatial graphs with communication constraints. When the spatial graph is fully connected and the communication graph is stationary, our decentralized algorithm matches or exceeds the best reported prior results from the literature.

Related papers

Networked Communication for Mean-Field Games with Function Approximation and Empirical Mean-Field Estimation [59.01527054553122]
Decentralised agents can learn equilibria in Mean-Field Games from a single, non-episodic run of the empirical system. We introduce function approximation to the existing setting, drawing on the Munchausen Online Mirror Descent method. We additionally provide new algorithms that allow agents to estimate the global empirical distribution based on a local neighbourhood.
arXiv Detail & Related papers (2024-08-21T13:32:46Z)
Distributed Optimization via Kernelized Multi-armed Bandits [6.04275169308491]
We model a distributed optimization problem as a multi-agent kernelized multi-armed bandit problem with a heterogeneous reward setting. We present a fully decentralized algorithm, Multi-agent IGP-UCB (MA-IGP-UCB), which achieves a sub-linear regret bound for popular classes for kernels. We also propose an extension, Multi-agent Delayed IGP-UCB (MAD-IGP-UCB) algorithm, which reduces the dependence of the regret bound on the number of agents in the network.
arXiv Detail & Related papers (2023-12-07T21:57:48Z)
RGMComm: Return Gap Minimization via Discrete Communications in Multi-Agent Reinforcement Learning [33.86277578441437]
Communication is crucial for solving cooperative Multi-Agent Reinforcement Learning tasks in partially observable Markov Decision Processes. We propose the Return-Gap-Minimization Communication (RGMComm) algorithm, which is a surprisingly simple design of discrete message generation functions. Evaluations show that RGMComm significantly outperforms state-of-the-art multi-agent communication baselines.
arXiv Detail & Related papers (2023-08-07T07:26:55Z)
Scalable Multi-agent Covering Option Discovery based on Kronecker Graphs [49.71319907864573]
In this paper, we propose multi-agent skill discovery which enables the ease of decomposition. Our key idea is to approximate the joint state space as a Kronecker graph, based on which we can directly estimate its Fiedler vector. Considering that directly computing the Laplacian spectrum is intractable for tasks with infinite-scale state spaces, we further propose a deep learning extension of our method.
arXiv Detail & Related papers (2023-07-21T14:53:12Z)
Federated Learning for Heterogeneous Bandits with Unobserved Contexts [0.0]
We study the problem of federated multi-arm contextual bandits with unknown contexts. We propose an elimination-based algorithm and prove the regret bound for linearly parametrized reward functions.
arXiv Detail & Related papers (2023-03-29T22:06:24Z)
QuTE: decentralized multiple testing on sensor networks with false discovery rate control [130.7122910646076]
This paper designs methods for decentralized multiple hypothesis testing on graphs equipped with provable guarantees on the false discovery rate (FDR) We consider the setting where distinct agents reside on the nodes of an undirected graph, and each agent possesses p-values corresponding to one or more hypotheses local to its node. Each agent must individually decide whether to reject one or more of its local hypotheses by only communicating with its neighbors, with the joint aim that the global FDR over the entire graph must be controlled at a predefined level.
arXiv Detail & Related papers (2022-10-09T19:48:39Z)
Optimal Clustering with Bandit Feedback [57.672609011609886]
This paper considers the problem of online clustering with bandit feedback. It includes a novel stopping rule for sequential testing that circumvents the need to solve any NP-hard weighted clustering problem as its subroutines. We show through extensive simulations on synthetic and real-world datasets that BOC's performance matches the lower boundally, and significantly outperforms a non-adaptive baseline algorithm.
arXiv Detail & Related papers (2022-02-09T06:05:05Z)
Decentralized Optimization Over the Stiefel Manifold by an Approximate Augmented Lagrangian Function [7.768673476492471]
This paper focuses on the decentralized optimization problem over the Stiefel manifold. Agents can only communicate with their neighbors in a collaborative effort to solve this problem. In existing methods, multiple rounds of communications are required to guarantee the convergence. This paper proposes a decentralized algorithm, called DESTINY, which only invokes a single round of communications per iteration.
arXiv Detail & Related papers (2021-12-30T07:19:59Z)
Distributed Adaptive Learning Under Communication Constraints [54.22472738551687]
This work examines adaptive distributed learning strategies designed to operate under communication constraints. We consider a network of agents that must solve an online optimization problem from continual observation of streaming data.
arXiv Detail & Related papers (2021-12-03T19:23:48Z)
Decentralized Multi-Armed Bandit Can Outperform Classic Upper Confidence Bound: A Homogeneous Case over Strongly Connected Graphs [9.84486119211443]
This paper studies a homogeneous decentralized multi-armed bandit problem, in which a network of multiple agents faces the same set of arms. A fully decentralized upper bound confidence (UCB) algorithm is proposed for a multi-agent network whose neighbor relations are described by a directed graph.
arXiv Detail & Related papers (2021-11-22T01:05:52Z)
Acceleration in Distributed Optimization Under Similarity [72.54787082152278]
We study distributed (strongly convex) optimization problems over a network of agents, with no centralized nodes. An $varepsilon$-solution is achieved in $tildemathcalrhoObig(sqrtfracbeta/mu (1-)log1/varepsilonbig)$ number of communications steps. This rate matches (up to poly-log factors) for the first time lower complexity communication bounds of distributed gossip-algorithms applied to the class of problems of interest.
arXiv Detail & Related papers (2021-10-24T04:03:00Z)
Dimension-Free Rates for Natural Policy Gradient in Multi-Agent Reinforcement Learning [22.310861786709538]
We propose a scalable algorithm for cooperative multi-agent reinforcement learning. We show that our algorithm converges to the globally optimal policy with a dimension-free statistical and computational complexity.
arXiv Detail & Related papers (2021-09-23T23:38:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.