Decentralized Cooperative Reinforcement Learning with Hierarchical
Information Structure
- URL: http://arxiv.org/abs/2111.00781v1
- Date: Mon, 1 Nov 2021 09:18:07 GMT
- Title: Decentralized Cooperative Reinforcement Learning with Hierarchical
Information Structure
- Authors: Hsu Kao, Chen-Yu Wei, Vijay Subramanian
- Abstract summary: We consider two-agent multi-armed bandits (MABs) and Markov decision processes (MDPs) with a hierarchical information structure arising in applications.
In each step the leader" chooses her action first, and then the follower" decides his action after observing the leader's action.
For the MDP setting, we obtain $widetildemathcalO(sqrtH7S2ABT)$ regret, where $H$ is the number of steps per episode, $S$ is the number of states
- Score: 14.919120396838208
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-agent reinforcement learning (MARL) problems are challenging due to
information asymmetry. To overcome this challenge, existing methods often
require high level of coordination or communication between the agents. We
consider two-agent multi-armed bandits (MABs) and Markov decision processes
(MDPs) with a hierarchical information structure arising in applications, which
we exploit to propose simpler and more efficient algorithms that require no
coordination or communication. In the structure, in each step the ``leader"
chooses her action first, and then the ``follower" decides his action after
observing the leader's action. The two agents observe the same reward (and the
same state transition in the MDP setting) that depends on their joint action.
For the bandit setting, we propose a hierarchical bandit algorithm that
achieves a near-optimal gap-independent regret of
$\widetilde{\mathcal{O}}(\sqrt{ABT})$ and a near-optimal gap-dependent regret
of $\mathcal{O}(\log(T))$, where $A$ and $B$ are the numbers of actions of the
leader and the follower, respectively, and $T$ is the number of steps. We
further extend to the case of multiple followers and the case with a deep
hierarchy, where we both obtain near-optimal regret bounds. For the MDP
setting, we obtain $\widetilde{\mathcal{O}}(\sqrt{H^7S^2ABT})$ regret, where
$H$ is the number of steps per episode, $S$ is the number of states, $T$ is the
number of episodes. This matches the existing lower bound in terms of $A, B$,
and $T$.
Related papers
- Cooperative Multi-Agent Reinforcement Learning: Asynchronous
Communication and Linear Function Approximation [77.09836892653176]
We study multi-agent reinforcement learning in the setting of episodic Markov decision processes.
We propose a provably efficient algorithm based on value that enable asynchronous communication.
We show that a minimal $Omega(dM)$ communication complexity is required to improve the performance through collaboration.
arXiv Detail & Related papers (2023-05-10T20:29:29Z) - Adversarial Online Multi-Task Reinforcement Learning [12.421997449847153]
We consider the adversarial online multi-task reinforcement learning setting.
In each of $K$ episodes the learner is given an unknown task taken from a finite set of $M$ unknown finite-horizon MDP models.
The learner's objective is to generalize its regret with respect to the optimal policy for each task.
arXiv Detail & Related papers (2023-01-11T02:18:26Z) - Provably Efficient Model-free RL in Leader-Follower MDP with Linear
Function Approximation [1.370633147306388]
We consider a multi-agent episodic MDP setup where an agent (leader) takes action at each step of the episode followed by another agent (follower)
We propose a em model-free RL algorithm and show that $tildemathcalO(sqrtd3H3T)$ regret bounds can be achieved for both the leader and the follower.
This is the first sub-linear regret bound guarantee for the Markov games with non-myopic followers with function approximation.
arXiv Detail & Related papers (2022-11-28T21:59:58Z) - On the Sample Complexity of Representation Learning in Multi-task
Bandits with Global and Local structure [77.60508571062958]
We investigate the sample complexity of learning the optimal arm for multi-task bandit problems.
Arms consist of two components: one that is shared across tasks (that we call representation) and one that is task-specific (that we call predictor)
We devise an algorithm OSRL-SC whose sample complexity approaches the lower bound, and scales at most as $H(Glog(delta_G)+ Xlog(delta_H))$, with $X,G,H$ being, respectively, the number of tasks, representations and predictors.
arXiv Detail & Related papers (2022-11-28T08:40:12Z) - Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches.
We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes.
Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
arXiv Detail & Related papers (2022-10-15T09:22:22Z) - Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs)
Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z) - Horizon-Free Reinforcement Learning in Polynomial Time: the Power of
Stationary Policies [88.75843804630772]
We design an algorithm that achieves an $Oleft(mathrmpoly(S,A,log K)sqrtKright)$ regret in contrast to existing bounds.
Our result relies on a sequence of new structural lemmas establishing the approximation power, stability, and concentration property of stationary policies.
arXiv Detail & Related papers (2022-03-24T08:14:12Z) - Distributed Bandits with Heterogeneous Agents [38.90376765616447]
This paper tackles a multi-agent bandit setting where $M$ agents cooperate together to solve the same instance of a $K$-armed bandit problem.
We propose two learning algorithms, ucbo and AAE.
We prove that both algorithms achieve order-optimal regret, which is $Oleft(sum_i:tildeDelta_i>0 log T/tildeDelta_iright)$, where $tildeDelta_i$ is the minimum suboptimality gap between the reward mean of
arXiv Detail & Related papers (2022-01-23T20:04:15Z) - Communication Efficient Parallel Reinforcement Learning [34.77250498401055]
We consider the problem where $M$ agents interact with $M$ identical and independent environments with $S$ states and $A$ actions.
We aim to find an algorithm that allows the agents to minimize the regret with infrequent communication rounds.
arXiv Detail & Related papers (2021-02-22T02:46:36Z) - Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive
Multi-Step Bootstrap [84.66885506098724]
This paper presents a new model-free algorithm for episodic finite-horizon Markov Decision Processes (MDP), Adaptive Multi-step Bootstrap (AMB)
We show AMB achieves a gap-dependent regret bound that only scales with the sum of the inverse of the sub-optimality gaps.
We also show AMB suffers an additional $frac|Z_mul|Delta_min$ regret, where $Z_mul$ is the set of state-action pairs $(s,a)$'s satisfying $a$ is a non-unique optimal action for
arXiv Detail & Related papers (2021-02-09T07:46:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.