On the Near-Optimality of Local Policies in Large Cooperative
Multi-Agent Reinforcement Learning
- URL: http://arxiv.org/abs/2209.03491v1
- Date: Wed, 7 Sep 2022 23:15:08 GMT
- Title: On the Near-Optimality of Local Policies in Large Cooperative
Multi-Agent Reinforcement Learning
- Authors: Washim Uddin Mondal, Vaneet Aggarwal, Satish V. Ukkusuri
- Abstract summary: We show that in a cooperative $N$-agent network, one can design locally executable policies for the agents.
We also devise an algorithm to explicitly construct a local policy.
- Score: 37.63373979256335
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We show that in a cooperative $N$-agent network, one can design locally
executable policies for the agents such that the resulting discounted sum of
average rewards (value) well approximates the optimal value computed over all
(including non-local) policies. Specifically, we prove that, if $|\mathcal{X}|,
|\mathcal{U}|$ denote the size of state, and action spaces of individual
agents, then for sufficiently small discount factor, the approximation error is
given by $\mathcal{O}(e)$ where $e\triangleq
\frac{1}{\sqrt{N}}\left[\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}\right]$.
Moreover, in a special case where the reward and state transition functions are
independent of the action distribution of the population, the error improves to
$\mathcal{O}(e)$ where $e\triangleq \frac{1}{\sqrt{N}}\sqrt{|\mathcal{X}|}$.
Finally, we also devise an algorithm to explicitly construct a local policy.
With the help of our approximation results, we further establish that the
constructed local policy is within $\mathcal{O}(\max\{e,\epsilon\})$ distance
of the optimal policy, and the sample complexity to achieve such a local policy
is $\mathcal{O}(\epsilon^{-3})$, for any $\epsilon>0$.
Related papers
- Robust Distribution Learning with Local and Global Adversarial Corruptions [17.22168727622332]
We develop an efficient finite-sample algorithm with error bounded by $sqrtvarepsilon k + rho + tildeO(dsqrtkn-1/(k lor 2))$ when $P$ has bounded covariance.
Our efficient procedure relies on a novel trace norm approximation of an ideal yet intractable 2-Wasserstein projection estimator.
arXiv Detail & Related papers (2024-06-10T17:48:36Z) - Multiple-policy Evaluation via Density Estimation [30.914344538340412]
We propose an algorithm named $mathrmCAESAR$ for this problem.
Up to low order and logarithmic terms $mathrmCAESAR$ achieves a sample complexity $tildeOleft(fracH4epsilon2sum_h=1Hmax_kin[K]sum_s,afrac(d_hpik(s,a))2mu*_h(s,a)right)$, where $dpi
arXiv Detail & Related papers (2024-03-29T23:55:25Z) - Nearly Optimal Regret for Decentralized Online Convex Optimization [53.433398074919]
Decentralized online convex optimization (D-OCO) aims to minimize a sequence of global loss functions using only local computations and communications.
We develop novel D-OCO algorithms that can respectively reduce the regret bounds for convex and strongly convex functions.
Our algorithms are nearly optimal in terms of $T$, $n$, and $rho$.
arXiv Detail & Related papers (2024-02-14T13:44:16Z) - Mean-Field Control based Approximation of Multi-Agent Reinforcement
Learning in Presence of a Non-decomposable Shared Global State [37.63373979256335]
Mean Field Control (MFC) is a powerful approximation tool to solve large-scale Multi-Agent Reinforcement Learning (MARL) problems.
Here we demonstrate that even in a MARL setting where agents share a common global state, the MFC can still be applied as a good approximation tool.
arXiv Detail & Related papers (2023-01-13T18:55:58Z) - Near Sample-Optimal Reduction-based Policy Learning for Average Reward
MDP [58.13930707612128]
This work considers the sample complexity of obtaining an $varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP)
We prove an upper bound of $widetilde O(H varepsilon-3 ln frac1delta)$ samples per state-action pair, where $H := sp(h*)$ is the span of bias of any optimal policy, $varepsilon$ is the accuracy and $delta$ is the failure probability.
arXiv Detail & Related papers (2022-12-01T15:57:58Z) - Nearly Optimal Policy Optimization with Stable at Any Time Guarantee [53.155554415415445]
Policy-based method in citetshani 2020optimistic is only $tildeO(sqrtSAH3K + sqrtAH4K)$ where $S$ is the number of states, $A$ is the number of actions, $H$ is the horizon, and $K$ is the number of episodes, and there is a $sqrtSH$ gap compared with the information theoretic lower bound $tildeOmega(sqrtSAH
arXiv Detail & Related papers (2021-12-21T01:54:17Z) - On the Approximation of Cooperative Heterogeneous Multi-Agent
Reinforcement Learning (MARL) using Mean Field Control (MFC) [33.833747074900856]
Mean field control (MFC) is an effective way to mitigate the curse of dimensionality of cooperative multi-agent reinforcement learning problems.
This work considers a collection of $N_mathrmpop$ heterogeneous agents that can be segregated into $K$ classes.
arXiv Detail & Related papers (2021-09-09T03:52:49Z) - Gap-Dependent Unsupervised Exploration for Reinforcement Learning [40.990467706237396]
We present an efficient algorithm for task-agnostic reinforcement learning.
The algorithm takes only $widetildemathcalO (1/epsilon cdot (H3SA / rho + H4 S2 A) )$ episodes of exploration.
We show that, information-theoretically, this bound is nearly tight for $rho Theta (1/(HS))$ and $H>1$.
arXiv Detail & Related papers (2021-08-11T20:42:46Z) - Provably Breaking the Quadratic Error Compounding Barrier in Imitation
Learning, Optimally [58.463668865380946]
We study the statistical limits of Imitation Learning in episodic Markov Decision Processes (MDPs) with a state space $mathcalS$.
We establish an upper bound $O(|mathcalS|H3/2/N)$ for the suboptimality using the Mimic-MD algorithm in Rajaraman et al ( 2020)
We show the minimax suboptimality grows as $Omega( H3/2/N)$ when $mathcalS|geq 3$ while the unknown-transition setting suffers from a larger sharp rate
arXiv Detail & Related papers (2021-02-25T15:50:19Z) - Nearly Minimax Optimal Reward-free Reinforcement Learning [88.75843804630772]
We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions.
We give a new efficient algorithm, textbfStaged textbfSampling + textbfTruncated textbfPlanning (algoname), which interacts with the environment at most $Oleft( fracS2Aepsilon2textpolylogleft(fracSAHepsilon2
arXiv Detail & Related papers (2020-10-12T17:51:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.