Minimal Expected Regret in Linear Quadratic Control
- URL: http://arxiv.org/abs/2109.14429v1
- Date: Wed, 29 Sep 2021 14:07:21 GMT
- Title: Minimal Expected Regret in Linear Quadratic Control
- Authors: Yassir Jedra, Alexandre Proutiere
- Abstract summary: We devise an online learning algorithm and provide guarantees on its expected regret.
This regret at time $T$ is upper bounded (i) by $widetildeO((d_u+d_x)sqrtd_xT)$ when $A$ and $B$ are unknown.
- Score: 79.81807680370677
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of online learning in Linear Quadratic Control
systems whose state transition and state-action transition matrices $A$ and $B$
may be initially unknown. We devise an online learning algorithm and provide
guarantees on its expected regret. This regret at time $T$ is upper bounded (i)
by $\widetilde{O}((d_u+d_x)\sqrt{d_xT})$ when $A$ and $B$ are unknown, (ii) by
$\widetilde{O}(d_x^2\log(T))$ if only $A$ is unknown, and (iii) by
$\widetilde{O}(d_x(d_u+d_x)\log(T))$ if only $B$ is unknown and under some mild
non-degeneracy condition ($d_x$ and $d_u$ denote the dimensions of the state
and of the control input, respectively). These regret scalings are minimal in
$T$, $d_x$ and $d_u$ as they match existing lower bounds in scenario (i) when
$d_x\le d_u$ [SF20], and in scenario (ii) [lai1986]. We conjecture that our
upper bounds are also optimal in scenario (iii) (there is no known lower bound
in this setting).
Existing online algorithms proceed in epochs of (typically exponentially)
growing durations. The control policy is fixed within each epoch, which
considerably simplifies the analysis of the estimation error on $A$ and $B$ and
hence of the regret. Our algorithm departs from this design choice: it is a
simple variant of certainty-equivalence regulators, where the estimates of $A$
and $B$ and the resulting control policy can be updated as frequently as we
wish, possibly at every step. Quantifying the impact of such a
constantly-varying control policy on the performance of these estimates and on
the regret constitutes one of the technical challenges tackled in this paper.
Related papers
- Sample-Efficient Reinforcement Learning with loglog(T) Switching Cost [31.04961854943877]
We propose a new algorithm that achieves a regret of $widetildeO(sqrtH4S2AT)$ while requiring a switching cost of $O(HSA loglog T)$.
As a byproduct, our new algorithmic techniques allow us to derive a emphreward-free exploration algorithm with an optimal switching cost of $O(HSA)$.
arXiv Detail & Related papers (2022-02-13T19:01:06Z) - Logarithmic Regret from Sublinear Hints [76.87432703516942]
We show that an algorithm can obtain $O(log T)$ regret with just $O(sqrtT)$ hints under a natural query model.
We also show that $o(sqrtT)$ hints cannot guarantee better than $Omega(sqrtT)$ regret.
arXiv Detail & Related papers (2021-11-09T16:50:18Z) - Private Stochastic Convex Optimization: Optimal Rates in $\ell_1$
Geometry [69.24618367447101]
Up to logarithmic factors the optimal excess population loss of any $(varepsilon,delta)$-differently private is $sqrtlog(d)/n + sqrtd/varepsilon n.$
We show that when the loss functions satisfy additional smoothness assumptions, the excess loss is upper bounded (up to logarithmic factors) by $sqrtlog(d)/n + (log(d)/varepsilon n)2/3.
arXiv Detail & Related papers (2021-03-02T06:53:44Z) - $Q$-learning with Logarithmic Regret [60.24952657636464]
We prove that an optimistic $Q$-learning enjoys a $mathcalOleft(fracSAcdot mathrmpolyleft(Hright)Delta_minlogleft(SATright)right)$ cumulative regret bound, where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $Delta_min$ is the minimum sub-optimality gap.
arXiv Detail & Related papers (2020-06-16T13:01:33Z) - Near-optimal Regret Bounds for Stochastic Shortest Path [63.029132134792555]
shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost.
We show that any learning algorithm must have at least $Omega(B_star sqrt|S| |A| K)$ regret in the worst case.
arXiv Detail & Related papers (2020-02-23T09:10:14Z) - Logistic Regression Regret: What's the Catch? [3.7311680121118345]
We derive lower bounds with logarithmic regret under $L_infty$ constraints on the parameters.
For $L$ constraints, it is shown that for large enough $d$, the regret remains linear in $d$ but no longer logarithmic in $T$.
arXiv Detail & Related papers (2020-02-07T18:36:39Z) - Naive Exploration is Optimal for Online LQR [49.681825576239355]
We show that the optimal regret scales as $widetildeTheta(sqrtd_mathbfu2 d_mathbfx T)$, where $T$ is the number of time steps, $d_mathbfu$ is the dimension of the input space, and $d_mathbfx$ is the dimension of the system state.
Our lower bounds rule out the possibility of a $mathrmpoly(logT)$-regret algorithm, which had been
arXiv Detail & Related papers (2020-01-27T03:44:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.