Communication Efficient Parallel Reinforcement Learning
- URL: http://arxiv.org/abs/2102.10740v1
- Date: Mon, 22 Feb 2021 02:46:36 GMT
- Title: Communication Efficient Parallel Reinforcement Learning
- Authors: Mridul Agarwal, Bhargav Ganguly, Vaneet Aggarwal
- Abstract summary: We consider the problem where $M$ agents interact with $M$ identical and independent environments with $S$ states and $A$ actions.
We aim to find an algorithm that allows the agents to minimize the regret with infrequent communication rounds.
- Score: 34.77250498401055
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We consider the problem where $M$ agents interact with $M$ identical and
independent environments with $S$ states and $A$ actions using reinforcement
learning for $T$ rounds. The agents share their data with a central server to
minimize their regret. We aim to find an algorithm that allows the agents to
minimize the regret with infrequent communication rounds. We provide \NAM\
which runs at each agent and prove that the total cumulative regret of $M$
agents is upper bounded as $\Tilde{O}(DS\sqrt{MAT})$ for a Markov Decision
Process with diameter $D$, number of states $S$, and number of actions $A$. The
agents synchronize after their visitations to any state-action pair exceeds a
certain threshold. Using this, we obtain a bound of $O\left(MSA\log(MT)\right)$
on the total number of communications rounds. Finally, we evaluate the
algorithm against multiple environments and demonstrate that the proposed
algorithm performs at par with an always communication version of the UCRL2
algorithm, while with significantly lower communication.
Related papers
- Federated Combinatorial Multi-Agent Multi-Armed Bandits [79.1700188160944]
This paper introduces a federated learning framework tailored for online optimization with bandit.
In this setting, agents subsets of arms, observe noisy rewards for these subsets without accessing individual arm information, and can cooperate and share information at specific intervals.
arXiv Detail & Related papers (2024-05-09T17:40:09Z) - Cooperative Multi-Agent Reinforcement Learning: Asynchronous
Communication and Linear Function Approximation [77.09836892653176]
We study multi-agent reinforcement learning in the setting of episodic Markov decision processes.
We propose a provably efficient algorithm based on value that enable asynchronous communication.
We show that a minimal $Omega(dM)$ communication complexity is required to improve the performance through collaboration.
arXiv Detail & Related papers (2023-05-10T20:29:29Z) - Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches.
We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes.
Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
arXiv Detail & Related papers (2022-10-15T09:22:22Z) - Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs)
Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z) - A Simple and Provably Efficient Algorithm for Asynchronous Federated
Contextual Linear Bandits [77.09836892653176]
We study federated contextual linear bandits, where $M$ agents cooperate with each other to solve a global contextual linear bandit problem with the help of a central server.
We consider the asynchronous setting, where all agents work independently and the communication between one agent and the server will not trigger other agents' communication.
We prove that the regret of textttFedLinUCB is bounded by $tildeO(dsqrtsum_m=1M T_m)$ and the communication complexity is $tildeO(dM
arXiv Detail & Related papers (2022-07-07T06:16:19Z) - Distributed Bandits with Heterogeneous Agents [38.90376765616447]
This paper tackles a multi-agent bandit setting where $M$ agents cooperate together to solve the same instance of a $K$-armed bandit problem.
We propose two learning algorithms, ucbo and AAE.
We prove that both algorithms achieve order-optimal regret, which is $Oleft(sum_i:tildeDelta_i>0 log T/tildeDelta_iright)$, where $tildeDelta_i$ is the minimum suboptimality gap between the reward mean of
arXiv Detail & Related papers (2022-01-23T20:04:15Z) - Decentralized Cooperative Reinforcement Learning with Hierarchical
Information Structure [14.919120396838208]
We consider two-agent multi-armed bandits (MABs) and Markov decision processes (MDPs) with a hierarchical information structure arising in applications.
In each step the leader" chooses her action first, and then the follower" decides his action after observing the leader's action.
For the MDP setting, we obtain $widetildemathcalO(sqrtH7S2ABT)$ regret, where $H$ is the number of steps per episode, $S$ is the number of states
arXiv Detail & Related papers (2021-11-01T09:18:07Z) - Multi-Agent Multi-Armed Bandits with Limited Communication [41.63062883750805]
We consider the problem where $N$ agents interact with an instance of a $K$ arm bandit problem for $K gg N$.
The agents aim to simultaneously minimize the cumulative regret over all the agents for a total of $T$ time steps, the number of communication rounds, and the number of bits in each communication round.
We present Limited Communication Collaboration - Upper Bound (LCC-UCB), a doubling-epoch based algorithm where each agent communicates only after the end of the epoch and shares the index of the best arm it knows.
arXiv Detail & Related papers (2021-02-10T06:28:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.