Thompson sampling for linear quadratic mean-field teams
- URL: http://arxiv.org/abs/2011.04686v1
- Date: Mon, 9 Nov 2020 19:07:32 GMT
- Title: Thompson sampling for linear quadratic mean-field teams
- Authors: Mukul Gagrani, Sagar Sudhakara, Aditya Mahajan, Ashutosh Nayyar and Yi
Ouyang
- Abstract summary: We consider optimal control of an unknown multi-agent linear quadratic (LQ) system where the dynamics and the cost are coupled across the agents.
We propose a new Thompson sampling based learning algorithm which exploits the structure of the system model and show that the expected Bayesian regret of our proposed algorithm for a system with agents of different types at time horizon $T$ is $tildemathcalO big( |M|1.5 sqrtT big)$ irrespective of the total number of agents.
- Score: 3.957353452014781
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider optimal control of an unknown multi-agent linear quadratic (LQ)
system where the dynamics and the cost are coupled across the agents through
the mean-field (i.e., empirical mean) of the states and controls. Directly
using single-agent LQ learning algorithms in such models results in regret
which increases polynomially with the number of agents. We propose a new
Thompson sampling based learning algorithm which exploits the structure of the
system model and show that the expected Bayesian regret of our proposed
algorithm for a system with agents of $|M|$ different types at time horizon $T$
is $\tilde{\mathcal{O}} \big( |M|^{1.5} \sqrt{T} \big)$ irrespective of the
total number of agents, where the $\tilde{\mathcal{O}}$ notation hides
logarithmic factors in $T$. We present detailed numerical experiments to
illustrate the salient features of the proposed algorithm.
Related papers
- Cooperative Multi-Agent Constrained Stochastic Linear Bandits [2.099922236065961]
A network of $N$ agents communicate locally to minimize their collective regret while keeping their expected cost under a specified threshold $tau$.
We propose a safe distributed upper confidence bound algorithm, so called textitMA-OPLB, and establish a high probability bound on its $T$-round regret.
We show that our regret bound is of order $ mathcalOleft(fracdtau-c_0fraclog(NT)2sqrtNsqrtTlog (1/|lambda|)
arXiv Detail & Related papers (2024-10-22T19:34:53Z) - Computational-Statistical Gaps in Gaussian Single-Index Models [77.1473134227844]
Single-Index Models are high-dimensional regression problems with planted structure.
We show that computationally efficient algorithms, both within the Statistical Query (SQ) and the Low-Degree Polynomial (LDP) framework, necessarily require $Omega(dkstar/2)$ samples.
arXiv Detail & Related papers (2024-03-08T18:50:19Z) - Quantum option pricing via the Karhunen-Lo\`{e}ve expansion [11.698830761241107]
We consider the problem of pricing discretely monitored Asian options over $T$ monitoring points where the underlying asset is modeled by a geometric Brownian motion.
We provide two quantum algorithms with complexity poly-logarithmic in $T$ and in $1/epsilon$, where $epsilon$ is the additive approximation error.
arXiv Detail & Related papers (2024-02-15T17:37:23Z) - Scalable Primal-Dual Actor-Critic Method for Safe Multi-Agent RL with
General Utilities [12.104551746465932]
We investigate safe multi-agent reinforcement learning, where agents seek to collectively maximize an aggregate sum of local objectives while satisfying their own safety constraints.
Our algorithm converges to a first-order stationary point (FOSP) at the rate of $mathcalOleft(T-2/3right)$.
In the sample-based setting, we demonstrate that, with high probability, our algorithm requires $widetildemathcalOleft(epsilon-3.5right)$ samples to achieve an $epsilon$-FOSP.
arXiv Detail & Related papers (2023-05-27T20:08:35Z) - Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs [60.40452803295326]
We propose a new reward-free algorithm for learning linear Markov decision processes (MDPs)
At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward.
We show that our algorithm only needs to explore $tilde O( d2varepsilon-2)$ episodes to find an $varepsilon$-optimal policy.
arXiv Detail & Related papers (2023-03-17T17:53:28Z) - Representation Learning for General-sum Low-rank Markov Games [63.119870889883224]
We study multi-agent general-sum Markov games with nonlinear function approximation.
We focus on low-rank Markov games whose transition matrix admits a hidden low-rank structure on top of an unknown non-linear representation.
arXiv Detail & Related papers (2022-10-30T22:58:22Z) - Computationally Efficient Horizon-Free Reinforcement Learning for Linear
Mixture MDPs [111.75736569611159]
We propose the first computationally efficient horizon-free algorithm for linear mixture MDPs.
Our algorithm adapts a weighted least square estimator for the unknown transitional dynamic.
This also improves upon the best-known algorithms in this setting when $sigma_k2$'s are known.
arXiv Detail & Related papers (2022-05-23T17:59:18Z) - Minimax Optimal Quantization of Linear Models: Information-Theoretic
Limits and Efficient Algorithms [59.724977092582535]
We consider the problem of quantizing a linear model learned from measurements.
We derive an information-theoretic lower bound for the minimax risk under this setting.
We show that our method and upper-bounds can be extended for two-layer ReLU neural networks.
arXiv Detail & Related papers (2022-02-23T02:39:04Z) - Scalable regret for learning to control network-coupled subsystems with
unknown dynamics [5.670584589057048]
Viewing the interconnected subsystems globally results in a regret that increases super-linearly with the number of subsystems.
We propose a new Thompson sampling based learning algorithm which exploits the structure of the underlying network.
We show that the expected regret of the proposed algorithm is bounded by $tildemathcalO big( n sqrtT big)$ where $n$ is the number of subsystems, $T$ is the time horizon and the $tildemathcalO(cdot)$ notation hides logarithmic terms in $n
arXiv Detail & Related papers (2021-08-18T04:45:34Z) - Towards General Function Approximation in Zero-Sum Markov Games [126.58493169301012]
This paper considers two-player zero-sum finite-horizon Markov games with simultaneous moves.
Provably efficient algorithms for both decoupled and coordinated settings are developed.
arXiv Detail & Related papers (2021-07-30T15:25:13Z) - Regret Bounds for Decentralized Learning in Cooperative Multi-Agent
Dynamical Systems [3.9599054392856488]
quadratic analysis is challenging in Multi-Agent Reinforcement Learning (MARL)
We propose a MARL algorithm based on the construction of an auxiliary single-agent LQ problem.
We show that our algorithm provides a $tildeO(sqrtT)$ regret bound.
arXiv Detail & Related papers (2020-01-27T23:37:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.