Related papers: Optimism Stabilizes Thompson Sampling for Adaptive Inference

Optimism Stabilizes Thompson Sampling for Adaptive Inference

URL: http://arxiv.org/abs/2602.06014v1
Date: Thu, 05 Feb 2026 18:52:54 GMT
Title: Optimism Stabilizes Thompson Sampling for Adaptive Inference
Authors: Shunxing Yan, Han Zhong,
Abstract summary: Thompson sampling (TS) is widely used for multi-armed bandits, yet its inferential properties under adaptive data collection are subtle.<n>We study this phenomenon in the $K$-armed Gaussian bandit and identify emphoptimism as a key mechanism for restoring emphstability<n>We prove that variance-inflated TS citephalder2025stable is stable for any $K ge 2$, including the challenging regime where multiple arms are optimal.
Score: 9.558593674952654
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Thompson sampling (TS) is widely used for stochastic multi-armed bandits, yet its inferential properties under adaptive data collection are subtle. Classical asymptotic theory for sample means can fail because arm-specific sample sizes are random and coupled with the rewards through the action-selection rule. We study this phenomenon in the $K$-armed Gaussian bandit and identify \emph{optimism} as a key mechanism for restoring \emph{stability}, a sufficient condition for valid asymptotic inference requiring each arm's pull count to concentrate around a deterministic scale. First, we prove that variance-inflated TS \citep{halder2025stable} is stable for any $K \ge 2$, including the challenging regime where multiple arms are optimal. This resolves the open question raised by \citet{halder2025stable} through extending their results from the two-armed setting to the general $K$-armed setting. Second, we analyze an alternative optimistic modification that keeps the posterior variance unchanged but adds an explicit mean bonus to posterior mean, and establish the same stability conclusion. In summary, suitably implemented optimism stabilizes Thompson sampling and enables asymptotically valid inference in multi-armed bandits, while incurring only a mild additional regret cost.

Related papers

Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning [61.07540493350384]
Self-distillation (SD) is the process of retraining a student on a mixture of ground-truth and the teacher's own predictions.<n>We show that for any prediction risk, the optimally mixed student improves upon the ridge teacher for every regularization level.<n>We propose a consistent one-shot tuning method to estimate $star$ without grid search, sample splitting, or refitting.
arXiv Detail & Related papers (2026-02-19T17:21:15Z)
Thompson sampling: Precise arm-pull dynamics and adaptive inference [0.7614628596146601]
We study the precise arm-pull dynamics in another canonical class of Thompson-type algorithms.<n>We show that the arm-pull count is deterministic if and only if the arm is suboptimal or is the unique optimal arm.<n>We show that normalized arm means obey the same dichotomy, with limits for stable arms and a semi-universal, non-Gaussian limit for unstable arms.
arXiv Detail & Related papers (2026-01-29T00:12:04Z)
Continuous K-Max Bandits [54.21533414838677]
We study the $K$-Max multi-armed bandits problem with continuous outcome distributions and weak value-index feedback.<n>This setting captures critical applications in recommendation systems, distributed computing, server scheduling, etc.<n>Our key contribution is the computationally efficient algorithm DCK-UCB, which combines adaptive discretization with bias-corrected confidence bounds.
arXiv Detail & Related papers (2025-02-19T06:37:37Z)
Achieving Exponential Asymptotic Optimality in Average-Reward Restless Bandits without Global Attractor Assumption [11.41663079285674]
We propose a novel emphtwo-set policy that maintains two dynamic subsets of arms. We show that our two-set policy is optimalally with an $O(exp(-C N)$ optimality gap for an $N$-armed problem.
arXiv Detail & Related papers (2024-05-28T07:08:29Z)
Combinatorial Stochastic-Greedy Bandit [79.1700188160944]
We propose a novelgreedy bandit (SGB) algorithm for multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time $tin [T]$ is observed. SGB adopts an optimized-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms.
arXiv Detail & Related papers (2023-12-13T11:08:25Z)
Thompson Exploration with Best Challenger Rule in Best Arm Identification [59.02170783023547]
We study the fixed-confidence best arm identification problem in the bandit framework.<n>We propose a novel policy that combines Thompson sampling with a computationally efficient approach known as the best challenger rule.
arXiv Detail & Related papers (2023-10-01T01:37:02Z)
Adaptive Data Analysis in a Balanced Adversarial Model [26.58630744414181]
In adaptive data analysis, a mechanism gets $n$ i.i.d. samples from an unknown distribution $D$, and is required to provide accurate estimations. We consider more restricted adversaries, called emphbalanced, where each such adversary consists of two separated algorithms. We show that these stronger hardness assumptions are unavoidable in the sense that any computationally bounded emphbalanced adversary implies the existence of public-key cryptography.
arXiv Detail & Related papers (2023-05-24T15:08:05Z)
Finite-Time Regret of Thompson Sampling Algorithms for Exponential Family Multi-Armed Bandits [88.21288104408556]
We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family. We propose a Thompson sampling, termed Expulli, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm.
arXiv Detail & Related papers (2022-06-07T18:08:21Z)
Algorithms for Adaptive Experiments that Trade-off Statistical Analysis with Reward: Combining Uniform Random Assignment and Reward Maximization [50.725191156128645]
Multi-armed bandit algorithms like Thompson Sampling can be used to conduct adaptive experiments. We present simulations for 2-arm experiments that explore two algorithms that combine the benefits of uniform randomization for statistical analysis.
arXiv Detail & Related papers (2021-12-15T22:11:58Z)
Convergence Rates of Stochastic Gradient Descent under Infinite Noise Variance [14.06947898164194]
Heavy tails emerge in gradient descent (SGD) in various scenarios. We provide convergence guarantees for SGD under a state-dependent and heavy-tailed noise with a potentially infinite variance. Our results indicate that even under heavy-tailed noise with infinite variance, SGD can converge to the global optimum.
arXiv Detail & Related papers (2021-02-20T13:45:11Z)
Weak Signal Asymptotics for Sequentially Randomized Experiments [2.28438857884398]
We study a class of sequentially randomized experiments, including those that arise in solving multi-armed bandit problems. We show that the sample paths of a class of sequentially randomized experiments converge weakly to a diffusion limit. We show that all sequential experiments whose randomization probabilities have a Lipschitz-continuous dependence on the observed data suffer from sub-optimal regret when reward gaps are relatively large.
arXiv Detail & Related papers (2021-01-25T02:20:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.