On Frequentist Regret of Linear Thompson Sampling
- URL: http://arxiv.org/abs/2006.06790v3
- Date: Thu, 20 Apr 2023 21:35:31 GMT
- Title: On Frequentist Regret of Linear Thompson Sampling
- Authors: Nima Hamidi, Mohsen Bayati
- Abstract summary: This paper studies the linear bandit problem, where a decision-maker chooses actions from possibly time-dependent sets of vectors in $mathbbRd$ and receives noisy rewards.
The objective is to minimize regret, the difference between the cumulative expected reward of the decision-maker and that of an oracle with access to the expected reward of each action, over a sequence of $T$ decisions.
- Score: 8.071506311915398
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies the stochastic linear bandit problem, where a
decision-maker chooses actions from possibly time-dependent sets of vectors in
$\mathbb{R}^d$ and receives noisy rewards. The objective is to minimize regret,
the difference between the cumulative expected reward of the decision-maker and
that of an oracle with access to the expected reward of each action, over a
sequence of $T$ decisions. Linear Thompson Sampling (LinTS) is a popular
Bayesian heuristic, supported by theoretical analysis that shows its Bayesian
regret is bounded by $\widetilde{\mathcal{O}}(d\sqrt{T})$, matching minimax
lower bounds. However, previous studies demonstrate that the frequentist regret
bound for LinTS is $\widetilde{\mathcal{O}}(d\sqrt{dT})$, which requires
posterior variance inflation and is by a factor of $\sqrt{d}$ worse than the
best optimism-based algorithms. We prove that this inflation is fundamental and
that the frequentist bound of $\widetilde{\mathcal{O}}(d\sqrt{dT})$ is the best
possible, by demonstrating a randomization bias phenomenon in LinTS that can
cause linear regret without inflation.We propose a data-driven version of LinTS
that adjusts posterior inflation using observed data, which can achieve minimax
optimal frequentist regret, under additional conditions. Our analysis provides
new insights into LinTS and settles an open problem in the field.
Related papers
- Improved Bound for Robust Causal Bandits with Linear Models [16.60875994745622]
This paper investigates the robustness of causal bandits in the face of temporal model fluctuations.
The proposed algorithm achieves nearly optimal $tildemathcalO(sqrtT)$ regret when $C$ is $o(sqrtT)$.
arXiv Detail & Related papers (2024-05-13T14:41:28Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Adversarial Contextual Bandits Go Kernelized [21.007410990554522]
We study a generalization of the problem of online learning in adversarial linear contextual bandits by incorporating loss functions that belong to a Hilbert kernel space.
We propose a new optimistically biased estimator for the loss functions and reproducing near-optimal regret guarantees.
arXiv Detail & Related papers (2023-10-02T19:59:39Z) - Settling the Sample Complexity of Online Reinforcement Learning [92.02082223856479]
We show how to achieve minimax-optimal regret without incurring any burn-in cost.
We extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances.
arXiv Detail & Related papers (2023-07-25T15:42:11Z) - Estimating Optimal Policy Value in General Linear Contextual Bandits [50.008542459050155]
In many bandit problems, the maximal reward achievable by a policy is often unknown in advance.
We consider the problem of estimating the optimal policy value in the sublinear data regime before the optimal policy is even learnable.
We present a more practical, computationally efficient algorithm that estimates a problem-dependent upper bound on $V*$.
arXiv Detail & Related papers (2023-02-19T01:09:24Z) - The Sample Complexity of Approximate Rejection Sampling with
Applications to Smoothed Online Learning [29.44582058149344]
We show that the optimal total variation distance as a function of $n$ is given by $tildeTheta(fracDf'(n))$ over the class of all pairs $nu,mu$ with a bounded $f$-divergence.
We then consider an application in the seemingly very different field of smoothed online learning.
arXiv Detail & Related papers (2023-02-09T14:20:14Z) - Optimal Online Generalized Linear Regression with Stochastic Noise and
Its Application to Heteroscedastic Bandits [88.6139446295537]
We study the problem of online generalized linear regression in the setting of a generalized linear model with possibly unbounded additive noise.
We provide a sharp analysis of the classical follow-the-regularized-leader (FTRL) algorithm to cope with the label noise.
We propose an algorithm based on FTRL to achieve the first variance-aware regret bound.
arXiv Detail & Related papers (2022-02-28T08:25:26Z) - Improved Regret Analysis for Variance-Adaptive Linear Bandits and
Horizon-Free Linear Mixture MDPs [12.450760567361531]
In online learning problems, exploiting low variance plays an important role in obtaining tight performance guarantees.
We present novel analyses that improve their regret bounds significantly.
Our analysis critically relies on a novel elliptical potential count' lemma.
arXiv Detail & Related papers (2021-11-05T06:47:27Z) - Fast Rates for the Regret of Offline Reinforcement Learning [69.23654172273085]
We study the regret of reinforcement learning from offline data generated by a fixed behavior policy in an infinitehorizon discounted decision process (MDP)
We show that given any estimate for the optimal quality function $Q*$, the regret of the policy it defines converges at a rate given by the exponentiation of the $Q*$-estimate's pointwise convergence rate.
arXiv Detail & Related papers (2021-01-31T16:17:56Z) - Thompson Sampling for Linearly Constrained Bandits [18.477962150017095]
We describe LinConTS, an algorithm for bandits that place a linear constraint on the probability of earning a reward in every round.
We show that for LinConTS, the regret as well as the cumulative constraint violations are upper bounded by O(log T) for the suboptimal arms.
arXiv Detail & Related papers (2020-04-20T13:06:35Z) - Frequentist Regret Bounds for Randomized Least-Squares Value Iteration [94.47472987987805]
We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning (RL)
In this paper, we introduce an optimistically-perturbed variant of the popular randomized least-squares value (RLSVI)
Under the assumption that the Markov decision process has low-rank transition dynamics, we prove that the frequentist regret of RLSVI is upper-bounded by $widetilde O(d2 H2 sqrtT)$ where $ d $ are the feature dimension, $ H $ is the horizon, and $ T $ is the total number of
arXiv Detail & Related papers (2019-11-01T19:48:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.