Related papers: Sliding-Window Thompson Sampling for Non-Stationary Settings

Sliding-Window Thompson Sampling for Non-Stationary Settings

URL: http://arxiv.org/abs/2409.05181v3
Date: Sat, 14 Jun 2025 11:39:19 GMT
Title: Sliding-Window Thompson Sampling for Non-Stationary Settings
Authors: Marco Fiandri, Alberto Maria Metelli, Francesco Trovò,
Abstract summary: Non-stationary multi-armed bandits (NS-MABs) model sequential decision-making problems.<n>We analyze Thompson sampling-inspired (TS) algorithms for NS-MABs that both corrects and generalizes existing work.
Score: 20.143361197609934
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Non-stationary multi-armed bandits (NS-MABs) model sequential decision-making problems in which the expected rewards of a set of actions, a.k.a.~arms, evolve over time. In this paper, we fill a gap in the literature by providing a novel analysis of Thompson sampling-inspired (TS) algorithms for NS-MABs that both corrects and generalizes existing work. Specifically, we study the cumulative frequentist regret of two algorithms based on sliding-window TS approaches with different priors, namely $\textit{Beta-SWTS}$ and $\textit{$\gamma$-SWGTS}$. We derive a unifying regret upper bound for these algorithms that applies to any arbitrary NS-MAB (with either Bernoulli or subgaussian rewards). Our result introduces new indices that capture the inherent sources of complexity in the learning problem. Then, we specialize our general result to two of the most common NS-MAB settings: the $\textit{abruptly changing}$ and the $\textit{smoothly changing}$ environments, showing that it matches state-of-the-art results. Finally, we evaluate the performance of the analyzed algorithms in simulated environments and compare them with state-of-the-art approaches for NS-MABs.

Related papers

Thompson Sampling-like Algorithms for Stochastic Rising Bandits [20.143361197609934]
Rising rested bandit (SRRB) is a setting where the arms' expected rewards increase as they are pulled.<n>This work provides novel regret analyses for such algorithms in SRRBs, highlighting the challenges and providing new technical tools of independent interest.
arXiv Detail & Related papers (2025-05-17T17:19:07Z)
Perturb-and-Project: Differentially Private Similarities and Marginals [73.98880839337873]
We revisit the input perturbations framework for differential privacy where noise is added to the input $Ain mathcalS$. We first design novel efficient algorithms to privately release pair-wise cosine similarities. We derive a novel algorithm to compute $k$-way marginal queries over $n$ features.
arXiv Detail & Related papers (2024-06-07T12:07:16Z)
Generalized Linear Bandits with Limited Adaptivity [12.112051468737596]
We study the generalized linear contextual bandit problem within the constraints of limited adaptivity. We present two algorithms, $textttB-GLinCB$ and $textttRS-GLinCB$, that address, respectively, two prevalent limited adaptivity settings.
arXiv Detail & Related papers (2024-04-10T08:47:57Z)
Variance-Dependent Regret Bounds for Non-stationary Linear Bandits [52.872628573907434]
We propose algorithms that utilize the variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds. We introduce two novel algorithms: Restarted Weighted$textOFUL+$ and Restarted $textSAVE+$. Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary linear bandits under different settings.
arXiv Detail & Related papers (2024-03-15T23:36:55Z)
Efficient Frameworks for Generalized Low-Rank Matrix Bandit Problems [61.85150061213987]
We study the generalized low-rank matrix bandit problem, proposed in citelu2021low under the Generalized Linear Model (GLM) framework. To overcome the computational infeasibility and theoretical restrain of existing algorithms, we first propose the G-ESTT framework. We show that G-ESTT can achieve the $tildeO(sqrt(d_1+d_2)3/2Mr3/2T)$ bound of regret while G-ESTS can achineve the $tildeO
arXiv Detail & Related papers (2024-01-14T14:14:19Z)
Combinatorial Stochastic-Greedy Bandit [79.1700188160944]
We propose a novelgreedy bandit (SGB) algorithm for multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time $tin [T]$ is observed. SGB adopts an optimized-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms.
arXiv Detail & Related papers (2023-12-13T11:08:25Z)
Discounted Thompson Sampling for Non-Stationary Bandit Problems [13.656518163592349]
Non-stationary multi-armed bandit (NS-MAB) problems have recently received significant attention. We propose Discounted Thompson Sampling (DS-TS) with Gaussian priors to address both non-stationary settings. Our algorithm passively adapts to changes by incorporating a discounted factor into Thompson Sampling.
arXiv Detail & Related papers (2023-05-18T05:29:52Z)
Contextual Combinatorial Bandits with Probabilistically Triggered Arms [55.9237004478033]
We study contextual bandits with probabilistically triggered arms (C$2$MAB-T) under a variety of smoothness conditions. Under the triggering modulated (TPM) condition, we devise the C$2$-UC-T algorithm and derive a regret bound $tildeO(dsqrtT)$.
arXiv Detail & Related papers (2023-03-30T02:51:00Z)
Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation [69.0695698566235]
We study reinforcement learning with linear function approximation and adversarially changing cost functions. We present a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback.
arXiv Detail & Related papers (2023-01-30T17:26:39Z)
ANACONDA: An Improved Dynamic Regret Algorithm for Adaptive Non-Stationary Dueling Bandits [20.128001589147512]
We study the problem of non-stationary dueling bandits and provide the first adaptive dynamic regret algorithm for this problem. We show a near-optimal $tildeO(sqrtStexttCW T)$ dynamic regret bound, where $StexttCW$ is the number of times the Condorcet winner changes in $T$ rounds.
arXiv Detail & Related papers (2022-10-25T20:26:02Z)
Bayesian decision-making under misspecified priors with applications to meta-learning [64.38020203019013]
Thompson sampling and other sequential decision-making algorithms are popular approaches to tackle explore/exploit trade-offs in contextual bandits. We show that performance degrades gracefully with misspecified priors.
arXiv Detail & Related papers (2021-07-03T23:17:26Z)
Thompson Sampling for Unimodal Bandits [21.514495320038712]
We propose a Thompson Sampling algorithm for emphunimodal bandits, where the expected reward is unimodal over the partially ordered arms. For Gaussian rewards, the regret of our algorithm is $mathcalO(log T)$, which is far better than standard Thompson Sampling algorithms.
arXiv Detail & Related papers (2021-06-15T14:40:34Z)
Stochastic Linear Bandits Robust to Adversarial Attacks [117.665995707568]
We provide two variants of a Robust Phased Elimination algorithm, one that knows $C$ and one that does not. We show that both variants attain near-optimal regret in the non-corrupted case $C = 0$, while incurring additional additive terms respectively. In a contextual setting, we show that a simple greedy algorithm is provably robust with a near-optimal additive regret term, despite performing no explicit exploration and not knowing $C$.
arXiv Detail & Related papers (2020-07-07T09:00:57Z)
An Efficient Algorithm For Generalized Linear Bandit: Online Stochastic Gradient Descent and Thompson Sampling [83.48992319018147]
We consider the contextual bandit problem, where a player sequentially makes decisions based on past observations to maximize the cumulative reward. A natural way to resolve this problem is to apply online gradient descent (SGD) so that the per-step time and memory complexity can be reduced to constant. In this work, we show that online SGD can be applied to the generalized linear bandit problem. The proposed SGD-TS algorithm, which uses a single-step SGD update to exploit past information, achieves $tildeO(sqrtT)$ regret with the total time complexity that
arXiv Detail & Related papers (2020-06-07T01:12:39Z)
A General Theory of the Stochastic Linear Bandit and Its Applications [8.071506311915398]
We introduce a general analysis framework and a family of algorithms for the linear bandit problem. Our new notion of optimism in expectation gives rise to a new algorithm, called sieved greedy (SG) that reduces the overexploration problem in OFUL. In addition to proving that SG is theoretically rate optimal, our empirical simulations show that SG outperforms existing benchmarks such as greedy, OFUL, and TS.
arXiv Detail & Related papers (2020-02-12T18:54:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.