Related papers: A General Theory of the Stochastic Linear Bandit and Its Applications

A General Theory of the Stochastic Linear Bandit and Its Applications

URL: http://arxiv.org/abs/2002.05152v4
Date: Thu, 31 Mar 2022 22:56:43 GMT
Title: A General Theory of the Stochastic Linear Bandit and Its Applications
Authors: Nima Hamidi, Mohsen Bayati
Abstract summary: We introduce a general analysis framework and a family of algorithms for the linear bandit problem. Our new notion of optimism in expectation gives rise to a new algorithm, called sieved greedy (SG) that reduces the overexploration problem in OFUL. In addition to proving that SG is theoretically rate optimal, our empirical simulations show that SG outperforms existing benchmarks such as greedy, OFUL, and TS.
Score: 8.071506311915398
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent growing adoption of experimentation in practice has led to a surge of attention to multiarmed bandits as a technique to reduce the opportunity cost of online experiments. In this setting, a decision-maker sequentially chooses among a set of given actions, observes their noisy rewards, and aims to maximize her cumulative expected reward (or minimize regret) over a horizon of length $T$. In this paper, we introduce a general analysis framework and a family of algorithms for the stochastic linear bandit problem that includes well-known algorithms such as the optimism-in-the-face-of-uncertainty-linear-bandit (OFUL) and Thompson sampling (TS) as special cases. Our analysis technique bridges several streams of prior literature and yields a number of new results. First, our new notion of optimism in expectation gives rise to a new algorithm, called sieved greedy (SG) that reduces the overexploration problem in OFUL. SG utilizes the data to discard actions with relatively low uncertainty and then choosing one among the remaining actions greedily. In addition to proving that SG is theoretically rate optimal, our empirical simulations show that SG outperforms existing benchmarks such as greedy, OFUL, and TS. The second application of our general framework is (to the best of our knowledge) the first polylogarithmic (in $T$) regret bounds for OFUL and TS, under similar conditions as the ones by Goldenshluger and Zeevi (2013). Finally, we obtain sharper regret bounds for the $k$-armed contextual MABs by a factor of $\sqrt{k}$.

Related papers

Optimal and Practical Batched Linear Bandit Algorithm [8.087699764574788]
We study the linear bandit problem under limited adaptivity, known as the batched linear bandit.<n>We propose textttBLAE, a novel batched algorithm that integrates arm elimination with regularized G-optimal design.<n>Our analysis introduces new techniques for batch-wise optimal design and refined concentration bounds.
arXiv Detail & Related papers (2025-07-11T09:29:28Z)
Asymptotically Optimal Linear Best Feasible Arm Identification with Fixed Budget [55.938644481736446]
We introduce a novel algorithm for best feasible arm identification that guarantees an exponential decay in the error probability.<n>We validate our algorithm through comprehensive empirical evaluations across various problem instances with different levels of complexity.
arXiv Detail & Related papers (2025-06-03T02:56:26Z)
Efficient Frameworks for Generalized Low-Rank Matrix Bandit Problems [61.85150061213987]
We study the generalized low-rank matrix bandit problem, proposed in citelu2021low under the Generalized Linear Model (GLM) framework. To overcome the computational infeasibility and theoretical restrain of existing algorithms, we first propose the G-ESTT framework. We show that G-ESTT can achieve the $tildeO(sqrt(d_1+d_2)3/2Mr3/2T)$ bound of regret while G-ESTS can achineve the $tildeO
arXiv Detail & Related papers (2024-01-14T14:14:19Z)
Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits [53.281230333364505]
This paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM) We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $tilde Obig(dsqrtsum_t=1Tsigma_t2 + dbig)$. Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $tilde O(d)$ regret.
arXiv Detail & Related papers (2023-10-02T08:15:52Z)
Empirical Risk Minimization with Shuffled SGD: A Primal-Dual Perspective and Improved Bounds [12.699376765058137]
gradient descent (SGD) is perhaps the most prevalent optimization method in modern machine learning. It is only very recently that SGD with sampling without replacement -- shuffled SGD -- has been analyzed. We prove fine-grained complexity bounds that depend on the data matrix and are never worse than what is predicted by the existing bounds.
arXiv Detail & Related papers (2023-06-21T18:14:44Z)
Langevin Thompson Sampling with Logarithmic Communication: Bandits and Reinforcement Learning [34.4255062106615]
Thompson sampling (TS) is widely used in sequential decision making due to its ease of use and appealing empirical performance. We propose batched $textitLangevin Thompson Sampling$ algorithms that leverage MCMC methods to sample from approximate posteriors with only logarithmic communication costs in terms of batches. Our algorithms are computationally efficient and maintain the same order-optimal regret guarantees of $mathcalO(log T)$ for MABs, and $mathcalO(sqrtT)$ for RL.
arXiv Detail & Related papers (2023-06-15T01:16:29Z)
Finite-Time Regret of Thompson Sampling Algorithms for Exponential Family Multi-Armed Bandits [88.21288104408556]
We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family. We propose a Thompson sampling, termed Expulli, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm.
arXiv Detail & Related papers (2022-06-07T18:08:21Z)
Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data. We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z)
Misspecified Gaussian Process Bandit Optimization [59.30399661155574]
Kernelized bandit algorithms have shown strong empirical and theoretical performance for this problem. We introduce a emphmisspecified kernelized bandit setting where the unknown function can be $epsilon$--uniformly approximated by a function with a bounded norm in some Reproducing Kernel Hilbert Space (RKHS) We show that our algorithm achieves optimal dependence on $epsilon$ with no prior knowledge of misspecification.
arXiv Detail & Related papers (2021-11-09T09:00:02Z)
Bayesian decision-making under misspecified priors with applications to meta-learning [64.38020203019013]
Thompson sampling and other sequential decision-making algorithms are popular approaches to tackle explore/exploit trade-offs in contextual bandits. We show that performance degrades gracefully with misspecified priors.
arXiv Detail & Related papers (2021-07-03T23:17:26Z)
Lenient Regret and Good-Action Identification in Gaussian Process Bandits [43.03669155559218]
We study the problem of Gaussian process (GP) bandits under relaxed optimization criteria stating that any function value above a certain threshold is "good enough" On the practical side, we consider the problem of finding a single "good action" according to a known pre-specified threshold, and introduce several good-action identification algorithms that exploit knowledge of the threshold.
arXiv Detail & Related papers (2021-02-11T01:16:58Z)
Optimal Algorithms for Stochastic Multi-Armed Bandits with Heavy Tailed Rewards [24.983866845065926]
We consider multi-armed bandits with heavy-tailed rewards, whose $p$-th moment is bounded by a constant $nu_p$ for $1pleq2$. We propose a novel robust estimator which does not require $nu_p$ as prior information. We show that an error probability of the proposed estimator decays exponentially fast.
arXiv Detail & Related papers (2020-10-24T10:44:02Z)
Large-Scale Methods for Distributionally Robust Optimization [53.98643772533416]
We prove that our algorithms require a number of evaluations gradient independent of training set size and number of parameters. Experiments on MNIST and ImageNet confirm the theoretical scaling of our algorithms, which are 9--36 times more efficient than full-batch methods.
arXiv Detail & Related papers (2020-10-12T17:41:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.