Related papers: Discounted Thompson Sampling for Non-Stationary Bandit Problems

Discounted Thompson Sampling for Non-Stationary Bandit Problems

URL: http://arxiv.org/abs/2305.10718v2
Date: Mon, 22 May 2023 07:36:52 GMT
Title: Discounted Thompson Sampling for Non-Stationary Bandit Problems
Authors: Han Qi, Yue Wang, Li Zhu
Abstract summary: Non-stationary multi-armed bandit (NS-MAB) problems have recently received significant attention. We propose Discounted Thompson Sampling (DS-TS) with Gaussian priors to address both non-stationary settings. Our algorithm passively adapts to changes by incorporating a discounted factor into Thompson Sampling.
Score: 13.656518163592349
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Non-stationary multi-armed bandit (NS-MAB) problems have recently received significant attention. NS-MAB are typically modelled in two scenarios: abruptly changing, where reward distributions remain constant for a certain period and change at unknown time steps, and smoothly changing, where reward distributions evolve smoothly based on unknown dynamics. In this paper, we propose Discounted Thompson Sampling (DS-TS) with Gaussian priors to address both non-stationary settings. Our algorithm passively adapts to changes by incorporating a discounted factor into Thompson Sampling. DS-TS method has been experimentally validated, but analysis of the regret upper bound is currently lacking. Under mild assumptions, we show that DS-TS with Gaussian priors can achieve nearly optimal regret bound on the order of $\tilde{O}(\sqrt{TB_T})$ for abruptly changing and $\tilde{O}(T^{\beta})$ for smoothly changing, where $T$ is the number of time steps, $B_T$ is the number of breakpoints, $\beta$ is associated with the smoothly changing environment and $\tilde{O}$ hides the parameters independent of $T$ as well as logarithmic terms. Furthermore, empirical comparisons between DS-TS and other non-stationary bandit algorithms demonstrate its competitive performance. Specifically, when prior knowledge of the maximum expected reward is available, DS-TS has the potential to outperform state-of-the-art algorithms.

Related papers

Better Rates for Random Task Orderings in Continual Linear Models [50.11453013647086]
We analyze the forgetting, i.e., loss on previously seen tasks, after $k$ iterations. We develop novel last-iterate bounds in the realizable least squares setup, and apply them to derive new results for continual learning. We prove for the first time that randomization alone, with no task repetition, can prevent catastrophic forgetting in sufficiently long task.
arXiv Detail & Related papers (2025-04-06T18:39:45Z)
Efficient and Adaptive Posterior Sampling Algorithms for Bandits [5.050520326139362]
We study Thompson Sampling-based algorithms for bandits with bounded rewards. We propose two parameterized Thompson Sampling-based algorithms. Both algorithms achieve $O left(Klnalpha+1(T)/Delta right)$ regret bound, where $K$ is the number of arms, $T$ is the finite learning horizon, and $Delta$ denotes the single round performance loss when pulling a sub-optimal arm.
arXiv Detail & Related papers (2024-05-02T05:24:28Z)
Variance-Dependent Regret Bounds for Non-stationary Linear Bandits [52.872628573907434]
We propose algorithms that utilize the variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds. We introduce two novel algorithms: Restarted Weighted$textOFUL+$ and Restarted $textSAVE+$. Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary linear bandits under different settings.
arXiv Detail & Related papers (2024-03-15T23:36:55Z)
Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel. An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost. Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z)
Langevin Thompson Sampling with Logarithmic Communication: Bandits and Reinforcement Learning [34.4255062106615]
Thompson sampling (TS) is widely used in sequential decision making due to its ease of use and appealing empirical performance. We propose batched $textitLangevin Thompson Sampling$ algorithms that leverage MCMC methods to sample from approximate posteriors with only logarithmic communication costs in terms of batches. Our algorithms are computationally efficient and maintain the same order-optimal regret guarantees of $mathcalO(log T)$ for MABs, and $mathcalO(sqrtT)$ for RL.
arXiv Detail & Related papers (2023-06-15T01:16:29Z)
Contextual Combinatorial Bandits with Probabilistically Triggered Arms [55.9237004478033]
We study contextual bandits with probabilistically triggered arms (C$2$MAB-T) under a variety of smoothness conditions. Under the triggering modulated (TPM) condition, we devise the C$2$-UC-T algorithm and derive a regret bound $tildeO(dsqrtT)$.
arXiv Detail & Related papers (2023-03-30T02:51:00Z)
Double Thompson Sampling in Finite stochastic Games [10.559241770674724]
We consider the trade-off problem between exploration and exploitation under finite discounted Markov Decision Process. We propose a double Thompson sampling reinforcement learning algorithm(DTS) to solve this kind of problem.
arXiv Detail & Related papers (2022-02-21T06:11:51Z)
Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret [144.6358229217845]
We study the problem of learning in the shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state. We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus. We prove that EB-SSP achieves the minimax regret rate $widetildeO(B_star sqrtS A K)$, where $K$ is the number of episodes, $S$ is the number of states, $A$
arXiv Detail & Related papers (2021-04-22T17:20:48Z)
Stochastic Shortest Path with Adversarially Changing Costs [57.90236104782219]
shortest path (SSP) is a well-known problem in planning and control. We present the adversarial SSP model that also accounts for adversarial changes in the costs over time. We are the first to consider this natural setting of adversarial SSP and obtain sub-linear regret for it.
arXiv Detail & Related papers (2020-06-20T12:10:35Z)
An Efficient Algorithm For Generalized Linear Bandit: Online Stochastic Gradient Descent and Thompson Sampling [83.48992319018147]
We consider the contextual bandit problem, where a player sequentially makes decisions based on past observations to maximize the cumulative reward. A natural way to resolve this problem is to apply online gradient descent (SGD) so that the per-step time and memory complexity can be reduced to constant. In this work, we show that online SGD can be applied to the generalized linear bandit problem. The proposed SGD-TS algorithm, which uses a single-step SGD update to exploit past information, achieves $tildeO(sqrtT)$ regret with the total time complexity that
arXiv Detail & Related papers (2020-06-07T01:12:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.