Discounted Thompson Sampling for Non-Stationary Bandit Problems
- URL: http://arxiv.org/abs/2305.10718v2
- Date: Mon, 22 May 2023 07:36:52 GMT
- Title: Discounted Thompson Sampling for Non-Stationary Bandit Problems
- Authors: Han Qi, Yue Wang, Li Zhu
- Abstract summary: Non-stationary multi-armed bandit (NS-MAB) problems have recently received significant attention.
We propose Discounted Thompson Sampling (DS-TS) with Gaussian priors to address both non-stationary settings.
Our algorithm passively adapts to changes by incorporating a discounted factor into Thompson Sampling.
- Score: 13.656518163592349
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Non-stationary multi-armed bandit (NS-MAB) problems have recently received
significant attention. NS-MAB are typically modelled in two scenarios: abruptly
changing, where reward distributions remain constant for a certain period and
change at unknown time steps, and smoothly changing, where reward distributions
evolve smoothly based on unknown dynamics. In this paper, we propose Discounted
Thompson Sampling (DS-TS) with Gaussian priors to address both non-stationary
settings. Our algorithm passively adapts to changes by incorporating a
discounted factor into Thompson Sampling. DS-TS method has been experimentally
validated, but analysis of the regret upper bound is currently lacking. Under
mild assumptions, we show that DS-TS with Gaussian priors can achieve nearly
optimal regret bound on the order of $\tilde{O}(\sqrt{TB_T})$ for abruptly
changing and $\tilde{O}(T^{\beta})$ for smoothly changing, where $T$ is the
number of time steps, $B_T$ is the number of breakpoints, $\beta$ is associated
with the smoothly changing environment and $\tilde{O}$ hides the parameters
independent of $T$ as well as logarithmic terms. Furthermore, empirical
comparisons between DS-TS and other non-stationary bandit algorithms
demonstrate its competitive performance. Specifically, when prior knowledge of
the maximum expected reward is available, DS-TS has the potential to outperform
state-of-the-art algorithms.
Related papers
- Efficient and Adaptive Posterior Sampling Algorithms for Bandits [5.050520326139362]
We study Thompson Sampling-based algorithms for bandits with bounded rewards.
We propose two parameterized Thompson Sampling-based algorithms.
Both algorithms achieve $O left(Klnalpha+1(T)/Delta right)$ regret bound, where $K$ is the number of arms, $T$ is the finite learning horizon, and $Delta$ denotes the single round performance loss when pulling a sub-optimal arm.
arXiv Detail & Related papers (2024-05-02T05:24:28Z) - Variance-Dependent Regret Bounds for Non-stationary Linear Bandits [52.872628573907434]
We propose algorithms that utilize the variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds.
We introduce two novel algorithms: Restarted Weighted$textOFUL+$ and Restarted $textSAVE+$.
Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary linear bandits under different settings.
arXiv Detail & Related papers (2024-03-15T23:36:55Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Langevin Thompson Sampling with Logarithmic Communication: Bandits and
Reinforcement Learning [34.4255062106615]
Thompson sampling (TS) is widely used in sequential decision making due to its ease of use and appealing empirical performance.
We propose batched $textitLangevin Thompson Sampling$ algorithms that leverage MCMC methods to sample from approximate posteriors with only logarithmic communication costs in terms of batches.
Our algorithms are computationally efficient and maintain the same order-optimal regret guarantees of $mathcalO(log T)$ for MABs, and $mathcalO(sqrtT)$ for RL.
arXiv Detail & Related papers (2023-06-15T01:16:29Z) - Contextual Combinatorial Bandits with Probabilistically Triggered Arms [55.9237004478033]
We study contextual bandits with probabilistically triggered arms (C$2$MAB-T) under a variety of smoothness conditions.
Under the triggering modulated (TPM) condition, we devise the C$2$-UC-T algorithm and derive a regret bound $tildeO(dsqrtT)$.
arXiv Detail & Related papers (2023-03-30T02:51:00Z) - Double Thompson Sampling in Finite stochastic Games [10.559241770674724]
We consider the trade-off problem between exploration and exploitation under finite discounted Markov Decision Process.
We propose a double Thompson sampling reinforcement learning algorithm(DTS) to solve this kind of problem.
arXiv Detail & Related papers (2022-02-21T06:11:51Z) - Stochastic Shortest Path: Minimax, Parameter-Free and Towards
Horizon-Free Regret [144.6358229217845]
We study the problem of learning in the shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state.
We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus.
We prove that EB-SSP achieves the minimax regret rate $widetildeO(B_star sqrtS A K)$, where $K$ is the number of episodes, $S$ is the number of states, $A$
arXiv Detail & Related papers (2021-04-22T17:20:48Z) - Stochastic Shortest Path with Adversarially Changing Costs [57.90236104782219]
shortest path (SSP) is a well-known problem in planning and control.
We present the adversarial SSP model that also accounts for adversarial changes in the costs over time.
We are the first to consider this natural setting of adversarial SSP and obtain sub-linear regret for it.
arXiv Detail & Related papers (2020-06-20T12:10:35Z) - An Efficient Algorithm For Generalized Linear Bandit: Online Stochastic
Gradient Descent and Thompson Sampling [83.48992319018147]
We consider the contextual bandit problem, where a player sequentially makes decisions based on past observations to maximize the cumulative reward.
A natural way to resolve this problem is to apply online gradient descent (SGD) so that the per-step time and memory complexity can be reduced to constant.
In this work, we show that online SGD can be applied to the generalized linear bandit problem.
The proposed SGD-TS algorithm, which uses a single-step SGD update to exploit past information, achieves $tildeO(sqrtT)$ regret with the total time complexity that
arXiv Detail & Related papers (2020-06-07T01:12:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.