Optimal Rates for Bandit Nonstochastic Control
- URL: http://arxiv.org/abs/2305.15352v3
- Date: Wed, 25 Oct 2023 01:26:40 GMT
- Title: Optimal Rates for Bandit Nonstochastic Control
- Authors: Y. Jennifer Sun, Stephen Newman, Elad Hazan
- Abstract summary: We give an algorithm for bandit LQR and LQG which attains optimal regret (up to logarithmic factors) for both known and unknown systems.
A central component of our method is a new scheme for bandit convex optimization with memory, which is of independent interest.
- Score: 18.47192040293437
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Linear Quadratic Regulator (LQR) and Linear Quadratic Gaussian (LQG) control
are foundational and extensively researched problems in optimal control. We
investigate LQR and LQG problems with semi-adversarial perturbations and
time-varying adversarial bandit loss functions. The best-known sublinear regret
algorithm of \cite{gradu2020non} has a $T^{\frac{3}{4}}$ time horizon
dependence, and its authors posed an open question about whether a tight rate
of $\sqrt{T}$ could be achieved. We answer in the affirmative, giving an
algorithm for bandit LQR and LQG which attains optimal regret (up to
logarithmic factors) for both known and unknown systems. A central component of
our method is a new scheme for bandit convex optimization with memory, which is
of independent interest.
Related papers
- Tight Rates for Bandit Control Beyond Quadratics [2.961909021941052]
We develop an algorithm that achieves an.
tildeO(T)$ as optimal control for bandit non-stochastic smooth perturbation functions.
Our main contribution is an algorithm that achieves an.
tildeO(T)$ as optimal control for Bandit Convex (BCO) without memory.
arXiv Detail & Related papers (2024-10-01T18:35:08Z) - Dynamic Regret Minimization for Control of Non-stationary Linear
Dynamical Systems [18.783925692307054]
We present an algorithm that achieves the optimal dynamic regret of $tildemathcalO(sqrtST)$ where $S$ is the number of switches.
The crux of our algorithm is an adaptive non-stationarity detection strategy, which builds on an approach recently developed for contextual Multi-armed Bandit problems.
arXiv Detail & Related papers (2021-11-06T01:30:51Z) - Finite-time System Identification and Adaptive Control in Autoregressive
Exogenous Systems [79.67879934935661]
We study the problem of system identification and adaptive control of unknown ARX systems.
We provide finite-time learning guarantees for the ARX systems under both open-loop and closed-loop data collection.
arXiv Detail & Related papers (2021-08-26T18:00:00Z) - Efficient Optimistic Exploration in Linear-Quadratic Regulators via
Lagrangian Relaxation [107.06364966905821]
We study the exploration-exploitation dilemma in the linear quadratic regulator (LQR) setting.
Inspired by the extended value iteration algorithm used in optimistic algorithms for finite MDPs, we propose to relax the optimistic optimization of ofulq.
We show that an $epsilon$-optimistic controller can be computed efficiently by solving at most $Obig(log (1/epsilon)big)$ Riccati equations.
arXiv Detail & Related papers (2020-07-13T16:30:47Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Adaptive Control and Regret Minimization in Linear Quadratic Gaussian
(LQG) Setting [91.43582419264763]
We propose LqgOpt, a novel reinforcement learning algorithm based on the principle of optimism in the face of uncertainty.
LqgOpt efficiently explores the system dynamics, estimates the model parameters up to their confidence interval, and deploys the controller of the most optimistic model.
arXiv Detail & Related papers (2020-03-12T19:56:38Z) - Model Selection in Contextual Stochastic Bandit Problems [51.94632035240787]
We develop a meta-algorithm that selects between base algorithms.
We show through a lower bound that even when one of the base algorithms has $O(sqrtT)$ regret, in general it is impossible to get better than $Omega(sqrtT)$ regret.
arXiv Detail & Related papers (2020-03-03T18:46:34Z) - Naive Exploration is Optimal for Online LQR [49.681825576239355]
We show that the optimal regret scales as $widetildeTheta(sqrtd_mathbfu2 d_mathbfx T)$, where $T$ is the number of time steps, $d_mathbfu$ is the dimension of the input space, and $d_mathbfx$ is the dimension of the system state.
Our lower bounds rule out the possibility of a $mathrmpoly(logT)$-regret algorithm, which had been
arXiv Detail & Related papers (2020-01-27T03:44:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.