Related papers: Only Pay for What Is Uncertain: Variance-Adaptive Thompson Sampling

Only Pay for What Is Uncertain: Variance-Adaptive Thompson Sampling

URL: http://arxiv.org/abs/2303.09033v2
Date: Thu, 12 Oct 2023 10:34:03 GMT
Title: Only Pay for What Is Uncertain: Variance-Adaptive Thompson Sampling
Authors: Aadirupa Saha and Branislav Kveton
Abstract summary: Most bandit algorithms assume that the reward variances or their upper bounds are known, and that they are the same for all arms. This motivated prior works on variance-adaptive frequentist algorithms, which have strong instance-dependent regret bounds. We lay foundations for the Bayesian setting, which incorporates prior knowledge.
Score: 44.921905700729766
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Most bandit algorithms assume that the reward variances or their upper bounds are known, and that they are the same for all arms. This naturally leads to suboptimal performance and higher regret due to variance overestimation. On the other hand, underestimated reward variances may lead to linear regret due to committing early to a suboptimal arm. This motivated prior works on variance-adaptive frequentist algorithms, which have strong instance-dependent regret bounds but cannot incorporate prior knowledge on reward variances. We lay foundations for the Bayesian setting, which incorporates prior knowledge. This results in lower regret in practice, due to using the prior in the algorithm design, and also improved regret guarantees. Specifically, we study Gaussian bandits with {unknown heterogeneous reward variances}, and develop a Thompson sampling algorithm with prior-dependent Bayes regret bounds. We achieve lower regret with lower reward variances and more informative priors on them, which is precisely why we pay only for what is uncertain. This is the first result of its kind. Finally, we corroborate our theory with extensive experiments, which show the superiority of our variance-adaptive Bayesian algorithm over prior frequentist approaches. We also show that our approach is robust to model misspecification and can be applied with estimated priors.

Related papers

Catoni Contextual Bandits are Robust to Heavy-tailed Rewards [31.381627608971414]
We develop an algorithmic approach building on Catoni's estimator from robust statistics. We establish a regret bound that depends only on the cumulative reward variance and logarithmically on the reward range $R$. Our algorithm also enjoys a variance-based bound with logarithmic reward-range dependence.
arXiv Detail & Related papers (2025-02-04T17:03:32Z)
Variance-Dependent Regret Bounds for Non-stationary Linear Bandits [52.872628573907434]
We propose algorithms that utilize the variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds. We introduce two novel algorithms: Restarted Weighted$textOFUL+$ and Restarted $textSAVE+$. Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary linear bandits under different settings.
arXiv Detail & Related papers (2024-03-15T23:36:55Z)
Noise-Adaptive Confidence Sets for Linear Bandits and Application to Bayesian Optimization [15.275864909088577]
Adapting to a priori unknown noise level is a very important but challenging problem in sequential decision-making. We propose a novel confidence set that is semi-adaptive' to the unknown sub-Gaussian parameter $sigma_*2$. For bounded rewards, we propose a novel variance-adaptive confidence set that has much improved numerical performance upon prior art.
arXiv Detail & Related papers (2024-02-12T00:19:09Z)
Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits [53.281230333364505]
This paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM) We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $tilde Obig(dsqrtsum_t=1Tsigma_t2 + dbig)$. Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $tilde O(d)$ regret.
arXiv Detail & Related papers (2023-10-02T08:15:52Z)
Fixed-Budget Best-Arm Identification with Heterogeneous Reward Variances [12.00630538470713]
We study the problem of best-arm identification (BAI) in the fixed-budget setting with heterogeneous reward variances. We propose two variance-adaptive BAI algorithms for this setting: SHVar for known reward variances and SHAdaVar for unknown reward variances.
arXiv Detail & Related papers (2023-06-13T05:41:38Z)
Suboptimal Performance of the Bayes Optimal Algorithm in Frequentist Best Arm Identification [5.176134438571082]
We consider the fixed-budget best arm identification problem with rewards following normal distributions. In this problem, the forecaster is given $K$ arms (or treatments) and $T$ time steps. The algorithm's performance is evaluated by simple regret, reflecting the quality of the estimated best arm.
arXiv Detail & Related papers (2022-02-10T17:50:26Z)
On Slowly-varying Non-stationary Bandits [25.305949034527202]
We consider dynamic regret in non-stationary bandits with a slowly varying property. We establish the first instance-dependent regret upper bound for slowly varying non-stationary bandits. We show that our algorithm is essentially minimax optimal.
arXiv Detail & Related papers (2021-10-25T12:56:19Z)
Bayesian decision-making under misspecified priors with applications to meta-learning [64.38020203019013]
Thompson sampling and other sequential decision-making algorithms are popular approaches to tackle explore/exploit trade-offs in contextual bandits. We show that performance degrades gracefully with misspecified priors.
arXiv Detail & Related papers (2021-07-03T23:17:26Z)
Lenient Regret for Multi-Armed Bandits [72.56064196252498]
We consider the Multi-Armed Bandit (MAB) problem, where an agent sequentially chooses actions and observes rewards for the actions it took. While the majority of algorithms try to minimize the regret, i.e., the cumulative difference between the reward of the best action and the agent's action, this criterion might lead to undesirable results. We suggest a new, more lenient, regret criterion that ignores suboptimality gaps smaller than some $epsilon$.
arXiv Detail & Related papers (2020-08-10T08:30:52Z)
Thompson Sampling Algorithms for Mean-Variance Bandits [97.43678751629189]
We develop Thompson Sampling-style algorithms for mean-variance MAB. We also provide comprehensive regret analyses for Gaussian and Bernoulli bandits. Our algorithms significantly outperform existing LCB-based algorithms for all risk tolerances.
arXiv Detail & Related papers (2020-02-01T15:33:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.