A relaxed technical assumption for posterior sampling-based
reinforcement learning for control of unknown linear systems
- URL: http://arxiv.org/abs/2108.08502v1
- Date: Thu, 19 Aug 2021 05:25:28 GMT
- Title: A relaxed technical assumption for posterior sampling-based
reinforcement learning for control of unknown linear systems
- Authors: Mukul Gagrani and Sagar Sudhakara and Aditya Mahajan and Ashutosh
Nayyar and Yi Ouyang
- Abstract summary: We revisit the Thompson sampling algorithm to control an unknown linear quadratic (LQ) system recently proposed by Ouyang et al.
We show that by making a minor modification in the algorithm, this technical assumption on the induced norm can be replaced by a milder assumption in terms of the spectral radius of the closed loop system.
- Score: 5.715788333977634
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We revisit the Thompson sampling algorithm to control an unknown linear
quadratic (LQ) system recently proposed by Ouyang et al (arXiv:1709.04047). The
regret bound of the algorithm was derived under a technical assumption on the
induced norm of the closed loop system. In this technical note, we show that by
making a minor modification in the algorithm (in particular, ensuring that an
episode does not end too soon), this technical assumption on the induced norm
can be replaced by a milder assumption in terms of the spectral radius of the
closed loop system. The modified algorithm has the same Bayesian regret of
$\tilde{\mathcal{O}}(\sqrt{T})$, where $T$ is the time-horizon and the
$\tilde{\mathcal{O}}(\cdot)$ notation hides logarithmic terms in~$T$.
Related papers
- Approximate Thompson Sampling for Learning Linear Quadratic Regulators with $O(\sqrt{T})$ Regret [10.541541376305243]
We propose an approximate Thompson sampling algorithm that learns linear quadratic regulators (LQR) with an improved Bayesian regret bound of $O(sqrtT)$.
We show that the excitation signal induces the minimum eigenvalue of the preconditioner to grow over time, thereby accelerating the approximate posterior sampling process.
arXiv Detail & Related papers (2024-05-29T03:24:56Z) - Variance-Dependent Regret Bounds for Non-stationary Linear Bandits [52.872628573907434]
We propose algorithms that utilize the variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds.
We introduce two novel algorithms: Restarted Weighted$textOFUL+$ and Restarted $textSAVE+$.
Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary linear bandits under different settings.
arXiv Detail & Related papers (2024-03-15T23:36:55Z) - Efficient Frameworks for Generalized Low-Rank Matrix Bandit Problems [61.85150061213987]
We study the generalized low-rank matrix bandit problem, proposed in citelu2021low under the Generalized Linear Model (GLM) framework.
To overcome the computational infeasibility and theoretical restrain of existing algorithms, we first propose the G-ESTT framework.
We show that G-ESTT can achieve the $tildeO(sqrt(d_1+d_2)3/2Mr3/2T)$ bound of regret while G-ESTS can achineve the $tildeO
arXiv Detail & Related papers (2024-01-14T14:14:19Z) - Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement
Learning: Adaptivity and Computational Efficiency [90.40062452292091]
We present the first computationally efficient algorithm for linear bandits with heteroscedastic noise.
Our algorithm is adaptive to the unknown variance of noise and achieves an $tildeO(d sqrtsum_k = 1K sigma_k2 + d)$ regret.
We also propose a variance-adaptive algorithm for linear mixture Markov decision processes (MDPs) in reinforcement learning.
arXiv Detail & Related papers (2023-02-21T00:17:24Z) - Refined Regret for Adversarial MDPs with Linear Function Approximation [50.00022394876222]
We consider learning in an adversarial Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes.
This paper provides two algorithms that improve the regret to $tildemathcal O(K2/3)$ in the same setting.
arXiv Detail & Related papers (2023-01-30T14:37:21Z) - Squeeze All: Novel Estimator and Self-Normalized Bound for Linear
Contextual Bandits [18.971564419292893]
We propose a linear contextual bandit algorithm with $O(sqrtdTlog T)$ regret bound, where $d$ is the dimension of contexts and $T$ is the time horizon.
Our proposed algorithm is equipped with a novel estimator in which exploration is embedded through explicit randomization.
arXiv Detail & Related papers (2022-06-11T02:43:17Z) - Misspecified Gaussian Process Bandit Optimization [59.30399661155574]
Kernelized bandit algorithms have shown strong empirical and theoretical performance for this problem.
We introduce a emphmisspecified kernelized bandit setting where the unknown function can be $epsilon$--uniformly approximated by a function with a bounded norm in some Reproducing Kernel Hilbert Space (RKHS)
We show that our algorithm achieves optimal dependence on $epsilon$ with no prior knowledge of misspecification.
arXiv Detail & Related papers (2021-11-09T09:00:02Z) - Scalable regret for learning to control network-coupled subsystems with
unknown dynamics [5.670584589057048]
Viewing the interconnected subsystems globally results in a regret that increases super-linearly with the number of subsystems.
We propose a new Thompson sampling based learning algorithm which exploits the structure of the underlying network.
We show that the expected regret of the proposed algorithm is bounded by $tildemathcalO big( n sqrtT big)$ where $n$ is the number of subsystems, $T$ is the time horizon and the $tildemathcalO(cdot)$ notation hides logarithmic terms in $n
arXiv Detail & Related papers (2021-08-18T04:45:34Z) - Non-stationary Linear Bandits Revisited [26.082923174615495]
Non-stationary linear bandits are a variant of linear bandits with a time-varying underlying regression parameter.
We prove an $widetildeO(T3/4(1+P_T)1/4)$ dynamic regret for these algorithms, slightly worse than the rate as was anticipated.
arXiv Detail & Related papers (2021-03-09T10:07:17Z) - Regret and Belief Complexity Trade-off in Gaussian Process Bandits via
Information Thresholding [42.669970064867556]
We show how to characterize the trade-off between regret bounds of GP bandit algorithms and complexity of the posterior distributions.
We observe state of the art accuracy and complexity trade-offs for GP bandit algorithms applied to global optimization.
arXiv Detail & Related papers (2020-03-23T21:05:15Z) - Naive Exploration is Optimal for Online LQR [49.681825576239355]
We show that the optimal regret scales as $widetildeTheta(sqrtd_mathbfu2 d_mathbfx T)$, where $T$ is the number of time steps, $d_mathbfu$ is the dimension of the input space, and $d_mathbfx$ is the dimension of the system state.
Our lower bounds rule out the possibility of a $mathrmpoly(logT)$-regret algorithm, which had been
arXiv Detail & Related papers (2020-01-27T03:44:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.