Multi-Agent Stage-wise Conservative Linear Bandits
- URL: http://arxiv.org/abs/2510.00602v2
- Date: Tue, 07 Oct 2025 21:07:32 GMT
- Title: Multi-Agent Stage-wise Conservative Linear Bandits
- Authors: Amirhoseein Afsharrad, Ahmadreza Moradipari, Sanjay Lall,
- Abstract summary: We study the linear bandit problem in a multi-agent networked setting.<n>Agents must satisfy stage-wise conservative constraints.<n>We propose MA-SCLUCB, an episodic algorithm alternating between action selection and consensus-building phases.
- Score: 2.2557806157585834
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In many real-world applications such as recommendation systems, multiple learning agents must balance exploration and exploitation while maintaining safety guarantees to avoid catastrophic failures. We study the stochastic linear bandit problem in a multi-agent networked setting where agents must satisfy stage-wise conservative constraints. A network of $N$ agents collaboratively maximizes cumulative reward while ensuring that the expected reward at every round is no less than $(1-\alpha)$ times that of a baseline policy. Each agent observes local rewards with unknown parameters, but the network optimizes for the global parameter (average of local parameters). Agents communicate only with immediate neighbors, and each communication round incurs additional regret. We propose MA-SCLUCB (Multi-Agent Stage-wise Conservative Linear UCB), an episodic algorithm alternating between action selection and consensus-building phases. We prove that MA-SCLUCB achieves regret $\tilde{O}\left(\frac{d}{\sqrt{N}}\sqrt{T}\cdot\frac{\log(NT)}{\sqrt{\log(1/|\lambda_2|)}}\right)$ with high probability, where $d$ is the dimension, $T$ is the horizon, and $|\lambda_2|$ is the network's second largest eigenvalue magnitude. Our analysis shows: (i) collaboration yields $\frac{1}{\sqrt{N}}$ improvement despite local communication, (ii) communication overhead grows only logarithmically for well-connected networks, and (iii) stage-wise safety adds only lower-order regret. Thus, distributed learning with safety guarantees achieves near-optimal performance in reasonably connected networks.
Related papers
- Near-Optimal Online Learning for Multi-Agent Submodular Coordination: Tight Approximation and Communication Efficiency [52.60557300927007]
We present a $textbfMA-OSMA$ algorithm to transfer the discrete submodular problem into a continuous optimization.<n>We also introduce a projection-free $textbfMA-OSEA$ algorithm, which effectively utilizes the KL divergence by mixing a uniform distribution.<n>Our algorithms significantly improve the $(frac11+c)$-approximation provided by the state-of-the-art OSG algorithm.
arXiv Detail & Related papers (2025-02-07T15:57:56Z) - Cooperative Multi-Agent Constrained Stochastic Linear Bandits [2.099922236065961]
A network of $N$ agents communicate locally to minimize their collective regret while keeping their expected cost under a specified threshold $tau$.
We propose a safe distributed upper confidence bound algorithm, so called textitMA-OPLB, and establish a high probability bound on its $T$-round regret.
We show that our regret bound is of order $ mathcalOleft(fracdtau-c_0fraclog(NT)2sqrtNsqrtTlog (1/|lambda|)
arXiv Detail & Related papers (2024-10-22T19:34:53Z) - Federated Combinatorial Multi-Agent Multi-Armed Bandits [79.1700188160944]
This paper introduces a federated learning framework tailored for online optimization with bandit.
In this setting, agents subsets of arms, observe noisy rewards for these subsets without accessing individual arm information, and can cooperate and share information at specific intervals.
arXiv Detail & Related papers (2024-05-09T17:40:09Z) - Scalable Primal-Dual Actor-Critic Method for Safe Multi-Agent RL with
General Utilities [12.104551746465932]
We investigate safe multi-agent reinforcement learning, where agents seek to collectively maximize an aggregate sum of local objectives while satisfying their own safety constraints.
Our algorithm converges to a first-order stationary point (FOSP) at the rate of $mathcalOleft(T-2/3right)$.
In the sample-based setting, we demonstrate that, with high probability, our algorithm requires $widetildemathcalOleft(epsilon-3.5right)$ samples to achieve an $epsilon$-FOSP.
arXiv Detail & Related papers (2023-05-27T20:08:35Z) - Cooperative Multi-Agent Reinforcement Learning: Asynchronous
Communication and Linear Function Approximation [77.09836892653176]
We study multi-agent reinforcement learning in the setting of episodic Markov decision processes.
We propose a provably efficient algorithm based on value that enable asynchronous communication.
We show that a minimal $Omega(dM)$ communication complexity is required to improve the performance through collaboration.
arXiv Detail & Related papers (2023-05-10T20:29:29Z) - Multi-Agent congestion cost minimization with linear function
approximation [0.0]
This work considers multiple agents traversing a network from a source node to the goal node.
Agents' objective is to find a path to the goal node with minimum overall cost in a decentralized way.
We propose a novel multi-agent congestion cost minimization algorithm.
arXiv Detail & Related papers (2023-01-26T08:45:44Z) - A Simple and Provably Efficient Algorithm for Asynchronous Federated
Contextual Linear Bandits [77.09836892653176]
We study federated contextual linear bandits, where $M$ agents cooperate with each other to solve a global contextual linear bandit problem with the help of a central server.
We consider the asynchronous setting, where all agents work independently and the communication between one agent and the server will not trigger other agents' communication.
We prove that the regret of textttFedLinUCB is bounded by $tildeO(dsqrtsum_m=1M T_m)$ and the communication complexity is $tildeO(dM
arXiv Detail & Related papers (2022-07-07T06:16:19Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Collaborative Multi-agent Stochastic Linear Bandits [28.268809091816287]
We study a collaborative multi-agent linear bandit setting, where $N$ agents that form a network communicate locally to minimize their overall regret.
All the agents observe the corresponding rewards of the played actions and use an accelerated consensus procedure to compute an estimate of the average of the rewards obtained by all the agents.
arXiv Detail & Related papers (2022-05-12T19:46:35Z) - Decentralized Multi-Agent Linear Bandits with Safety Constraints [31.67685495996986]
We study decentralized linear bandits, where a network of $N$ agents acts cooperatively to solve a linear bandit-optimization problem.
We propose DLUCB: a fully decentralized algorithm that minimizes the cumulative regret over the entire network.
We show that our ideas extend naturally to the emerging, albeit more challenging, setting of safe bandits.
arXiv Detail & Related papers (2020-12-01T07:33:00Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.