Approximate Thompson Sampling for Learning Linear Quadratic Regulators with $O(\sqrt{T})$ Regret
- URL: http://arxiv.org/abs/2405.19380v1
- Date: Wed, 29 May 2024 03:24:56 GMT
- Title: Approximate Thompson Sampling for Learning Linear Quadratic Regulators with $O(\sqrt{T})$ Regret
- Authors: Yeoneung Kim, Gihun Kim, Insoon Yang,
- Abstract summary: We propose an approximate Thompson sampling algorithm that learns linear quadratic regulators (LQR) with an improved Bayesian regret bound of $O(sqrtT)$.
We show that the excitation signal induces the minimum eigenvalue of the preconditioner to grow over time, thereby accelerating the approximate posterior sampling process.
- Score: 10.541541376305243
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose an approximate Thompson sampling algorithm that learns linear quadratic regulators (LQR) with an improved Bayesian regret bound of $O(\sqrt{T})$. Our method leverages Langevin dynamics with a meticulously designed preconditioner as well as a simple excitation mechanism. We show that the excitation signal induces the minimum eigenvalue of the preconditioner to grow over time, thereby accelerating the approximate posterior sampling process. Moreover, we identify nontrivial concentration properties of the approximate posteriors generated by our algorithm. These properties enable us to bound the moments of the system state and attain an $O(\sqrt{T})$ regret bound without the unrealistic restrictive assumptions on parameter sets that are often used in the literature.
Related papers
- A Sample Efficient Alternating Minimization-based Algorithm For Robust Phase Retrieval [56.67706781191521]
In this work, we present a robust phase retrieval problem where the task is to recover an unknown signal.
Our proposed oracle avoids the need for computationally spectral descent, using a simple gradient step and outliers.
arXiv Detail & Related papers (2024-09-07T06:37:23Z) - Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo [104.9535542833054]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL)
We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo.
Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
arXiv Detail & Related papers (2023-05-29T17:11:28Z) - Constrained Online Two-stage Stochastic Optimization: Near Optimal Algorithms via Adversarial Learning [1.994307489466967]
We consider an online two-stage optimization with long-term constraints over a finite horizon of $T$ periods.
We develop online algorithms for the online two-stage problem from adversarial learning algorithms.
arXiv Detail & Related papers (2023-02-02T10:33:09Z) - Thompson Sampling for High-Dimensional Sparse Linear Contextual Bandits [17.11922027966447]
This work provides theoretical guarantees of Thompson sampling in high dimensional and sparse contextual bandits.
For faster computation, we use spike-and-slab prior to model the unknown parameter and variational inference instead of MCMC.
arXiv Detail & Related papers (2022-11-11T02:23:39Z) - Exploration in Linear Bandits with Rich Action Sets and its Implications
for Inference [23.364534479492715]
We show that the minimum eigenvalue of the expected matrix grows as $Omega(sqrtn) whenever the expected cumulative regret of the algorithm is $sqrtn)$.
We apply our result to two practical scenarios -- emphmodel selection and emphclustering in linear bandits.
arXiv Detail & Related papers (2022-07-23T20:25:07Z) - Model-free Learning of Regions of Attraction via Recurrent Sets [5.032993162348713]
We propose to learn sets that satisfy a more relaxed notion of containment known as recurrence.
We show that under mild assumptions a $tau$-recurrent set containing a stable equilibrium must be a subset of its ROA.
We then leverage this property to develop algorithms that compute inner approximations of the ROA using counter-examples of recurrence.
arXiv Detail & Related papers (2022-04-21T19:03:17Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z) - A relaxed technical assumption for posterior sampling-based
reinforcement learning for control of unknown linear systems [5.715788333977634]
We revisit the Thompson sampling algorithm to control an unknown linear quadratic (LQ) system recently proposed by Ouyang et al.
We show that by making a minor modification in the algorithm, this technical assumption on the induced norm can be replaced by a milder assumption in terms of the spectral radius of the closed loop system.
arXiv Detail & Related papers (2021-08-19T05:25:28Z) - Analysis and Design of Thompson Sampling for Stochastic Partial
Monitoring [91.22679787578438]
We present a novel Thompson-sampling-based algorithm for partial monitoring.
We prove that the new algorithm achieves the logarithmic problem-dependent expected pseudo-regret $mathrmO(log T)$ for a linearized variant of the problem with local observability.
arXiv Detail & Related papers (2020-06-17T05:48:33Z) - On Thompson Sampling with Langevin Algorithms [106.78254564840844]
Thompson sampling for multi-armed bandit problems enjoys favorable performance in both theory and practice.
It suffers from a significant limitation computationally, arising from the need for samples from posterior distributions at every iteration.
We propose two Markov Chain Monte Carlo (MCMC) methods tailored to Thompson sampling to address this issue.
arXiv Detail & Related papers (2020-02-23T22:35:29Z) - Frequentist Regret Bounds for Randomized Least-Squares Value Iteration [94.47472987987805]
We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning (RL)
In this paper, we introduce an optimistically-perturbed variant of the popular randomized least-squares value (RLSVI)
Under the assumption that the Markov decision process has low-rank transition dynamics, we prove that the frequentist regret of RLSVI is upper-bounded by $widetilde O(d2 H2 sqrtT)$ where $ d $ are the feature dimension, $ H $ is the horizon, and $ T $ is the total number of
arXiv Detail & Related papers (2019-11-01T19:48:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.