Efficient Optimistic Exploration in Linear-Quadratic Regulators via
Lagrangian Relaxation
- URL: http://arxiv.org/abs/2007.06482v1
- Date: Mon, 13 Jul 2020 16:30:47 GMT
- Title: Efficient Optimistic Exploration in Linear-Quadratic Regulators via
Lagrangian Relaxation
- Authors: Marc Abeille and Alessandro Lazaric
- Abstract summary: We study the exploration-exploitation dilemma in the linear quadratic regulator (LQR) setting.
Inspired by the extended value iteration algorithm used in optimistic algorithms for finite MDPs, we propose to relax the optimistic optimization of ofulq.
We show that an $epsilon$-optimistic controller can be computed efficiently by solving at most $Obig(log (1/epsilon)big)$ Riccati equations.
- Score: 107.06364966905821
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the exploration-exploitation dilemma in the linear quadratic
regulator (LQR) setting. Inspired by the extended value iteration algorithm
used in optimistic algorithms for finite MDPs, we propose to relax the
optimistic optimization of \ofulq and cast it into a constrained
\textit{extended} LQR problem, where an additional control variable implicitly
selects the system dynamics within a confidence interval. We then move to the
corresponding Lagrangian formulation for which we prove strong duality. As a
result, we show that an $\epsilon$-optimistic controller can be computed
efficiently by solving at most $O\big(\log(1/\epsilon)\big)$ Riccati equations.
Finally, we prove that relaxing the original \ofu problem does not impact the
learning performance, thus recovering the $\tilde{O}(\sqrt{T})$ regret of
\ofulq. To the best of our knowledge, this is the first computationally
efficient confidence-based algorithm for LQR with worst-case optimal regret
guarantees.
Related papers
- Second Order Methods for Bandit Optimization and Control [34.51425758864638]
We show that our algorithm achieves optimal (in terms of terms of convex functions that we call $kappa$-2020) regret bounds for a large class of convex functions.
We also investigate the adaptation of our second-order bandit algorithm to online convex optimization with memory.
arXiv Detail & Related papers (2024-02-14T04:03:38Z) - Accelerated Optimization Landscape of Linear-Quadratic Regulator [0.0]
A Nest-quadratic regulator (LQR) is a landmark problem in the field of optimal control.
A Lipschiz Hessian property of LQR is presented.
Euler scheme is utilized to discretize the hybrid dynamic system.
arXiv Detail & Related papers (2023-07-07T13:34:27Z) - Refined Regret for Adversarial MDPs with Linear Function Approximation [50.00022394876222]
We consider learning in an adversarial Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes.
This paper provides two algorithms that improve the regret to $tildemathcal O(K2/3)$ in the same setting.
arXiv Detail & Related papers (2023-01-30T14:37:21Z) - Optimal Dynamic Regret in LQR Control [23.91519151164528]
We consider the problem of nonstochastic control with a sequence of quadratic losses, i.e., LQR control.
We provide an online algorithm that achieves an optimal dynamic (policy) regret of $tildeO(textmaxn1/3 mathcalTV(M_1:n)2/3, 1)$.
arXiv Detail & Related papers (2022-06-18T18:00:21Z) - Efficient and Optimal Algorithms for Contextual Dueling Bandits under
Realizability [59.81339109121384]
We study the $K$ contextual dueling bandit problem, a sequential decision making setting in which the learner uses contextual information to make two decisions, but only observes emphpreference-based feedback suggesting that one decision was better than the other.
We provide a new algorithm that achieves the optimal regret rate for a new notion of best response regret, which is a strictly stronger performance measure than those considered in prior works.
arXiv Detail & Related papers (2021-11-24T07:14:57Z) - Dynamic Regret Minimization for Control of Non-stationary Linear
Dynamical Systems [18.783925692307054]
We present an algorithm that achieves the optimal dynamic regret of $tildemathcalO(sqrtST)$ where $S$ is the number of switches.
The crux of our algorithm is an adaptive non-stationarity detection strategy, which builds on an approach recently developed for contextual Multi-armed Bandit problems.
arXiv Detail & Related papers (2021-11-06T01:30:51Z) - Randomized Exploration for Reinforcement Learning with General Value
Function Approximation [122.70803181751135]
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm.
Our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises.
We complement the theory with an empirical evaluation across known difficult exploration tasks.
arXiv Detail & Related papers (2021-06-15T02:23:07Z) - Optimistic Policy Optimization with Bandit Feedback [70.75568142146493]
We propose an optimistic trust region policy optimization (TRPO) algorithm for which we establish $tilde O(sqrtS2 A H4 K)$ regret for previous rewards.
To the best of our knowledge, the two results are the first sub-linear regret bounds obtained for policy optimization algorithms with unknown transitions and bandit feedback.
arXiv Detail & Related papers (2020-02-19T15:41:18Z) - Naive Exploration is Optimal for Online LQR [49.681825576239355]
We show that the optimal regret scales as $widetildeTheta(sqrtd_mathbfu2 d_mathbfx T)$, where $T$ is the number of time steps, $d_mathbfu$ is the dimension of the input space, and $d_mathbfx$ is the dimension of the system state.
Our lower bounds rule out the possibility of a $mathrmpoly(logT)$-regret algorithm, which had been
arXiv Detail & Related papers (2020-01-27T03:44:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.