An Efficient Algorithm For Generalized Linear Bandit: Online Stochastic
Gradient Descent and Thompson Sampling
- URL: http://arxiv.org/abs/2006.04012v3
- Date: Sat, 5 Jun 2021 22:35:57 GMT
- Title: An Efficient Algorithm For Generalized Linear Bandit: Online Stochastic
Gradient Descent and Thompson Sampling
- Authors: Qin Ding, Cho-Jui Hsieh, James Sharpnack
- Abstract summary: We consider the contextual bandit problem, where a player sequentially makes decisions based on past observations to maximize the cumulative reward.
A natural way to resolve this problem is to apply online gradient descent (SGD) so that the per-step time and memory complexity can be reduced to constant.
In this work, we show that online SGD can be applied to the generalized linear bandit problem.
The proposed SGD-TS algorithm, which uses a single-step SGD update to exploit past information, achieves $tildeO(sqrtT)$ regret with the total time complexity that
- Score: 83.48992319018147
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the contextual bandit problem, where a player sequentially makes
decisions based on past observations to maximize the cumulative reward.
Although many algorithms have been proposed for contextual bandit, most of them
rely on finding the maximum likelihood estimator at each iteration, which
requires $O(t)$ time at the $t$-th iteration and are memory inefficient. A
natural way to resolve this problem is to apply online stochastic gradient
descent (SGD) so that the per-step time and memory complexity can be reduced to
constant with respect to $t$, but a contextual bandit policy based on online
SGD updates that balances exploration and exploitation has remained elusive. In
this work, we show that online SGD can be applied to the generalized linear
bandit problem. The proposed SGD-TS algorithm, which uses a single-step SGD
update to exploit past information and uses Thompson Sampling for exploration,
achieves $\tilde{O}(\sqrt{T})$ regret with the total time complexity that
scales linearly in $T$ and $d$, where $T$ is the total number of rounds and $d$
is the number of features. Experimental results show that SGD-TS consistently
outperforms existing algorithms on both synthetic and real datasets.
Related papers
- Indexed Minimum Empirical Divergence-Based Algorithms for Linear Bandits [55.938644481736446]
Indexed Minimum Empirical Divergence (IMED) is a highly effective approach to the multi-armed bandit problem.
It has been observed to empirically outperform UCB-based algorithms and Thompson Sampling.
We present novel linear versions of the IMED algorithm, which we call the family of LinIMED algorithms.
arXiv Detail & Related papers (2024-05-24T04:11:58Z) - Efficient and Adaptive Posterior Sampling Algorithms for Bandits [5.050520326139362]
We study Thompson Sampling-based algorithms for bandits with bounded rewards.
We propose two parameterized Thompson Sampling-based algorithms.
Both algorithms achieve $O left(Klnalpha+1(T)/Delta right)$ regret bound, where $K$ is the number of arms, $T$ is the finite learning horizon, and $Delta$ denotes the single round performance loss when pulling a sub-optimal arm.
arXiv Detail & Related papers (2024-05-02T05:24:28Z) - Variance-Dependent Regret Bounds for Non-stationary Linear Bandits [52.872628573907434]
We propose algorithms that utilize the variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds.
We introduce two novel algorithms: Restarted Weighted$textOFUL+$ and Restarted $textSAVE+$.
Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary linear bandits under different settings.
arXiv Detail & Related papers (2024-03-15T23:36:55Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Efficient Frameworks for Generalized Low-Rank Matrix Bandit Problems [61.85150061213987]
We study the generalized low-rank matrix bandit problem, proposed in citelu2021low under the Generalized Linear Model (GLM) framework.
To overcome the computational infeasibility and theoretical restrain of existing algorithms, we first propose the G-ESTT framework.
We show that G-ESTT can achieve the $tildeO(sqrt(d_1+d_2)3/2Mr3/2T)$ bound of regret while G-ESTS can achineve the $tildeO
arXiv Detail & Related papers (2024-01-14T14:14:19Z) - Efficient Algorithms for Generalized Linear Bandits with Heavy-tailed
Rewards [40.99322897009357]
We propose two novel algorithms based on truncation and mean of medians.
Our truncation-based algorithm supports online learning, distinguishing it from existing truncation-based approaches.
Our algorithms improve the regret bounds by a logarithmic factor compared to existing algorithms when $epsilon=1$.
arXiv Detail & Related papers (2023-10-28T13:01:10Z) - Discounted Thompson Sampling for Non-Stationary Bandit Problems [13.656518163592349]
Non-stationary multi-armed bandit (NS-MAB) problems have recently received significant attention.
We propose Discounted Thompson Sampling (DS-TS) with Gaussian priors to address both non-stationary settings.
Our algorithm passively adapts to changes by incorporating a discounted factor into Thompson Sampling.
arXiv Detail & Related papers (2023-05-18T05:29:52Z) - No Regrets for Learning the Prior in Bandits [30.478188057004175]
$tt AdaTS$ is a Thompson sampling algorithm that adapts sequentially to bandit tasks.
$tt AdaTS$ is a fully-Bayesian algorithm that can be implemented efficiently in several classes of bandit problems.
arXiv Detail & Related papers (2021-07-13T15:51:32Z) - Regret and Belief Complexity Trade-off in Gaussian Process Bandits via
Information Thresholding [42.669970064867556]
We show how to characterize the trade-off between regret bounds of GP bandit algorithms and complexity of the posterior distributions.
We observe state of the art accuracy and complexity trade-offs for GP bandit algorithms applied to global optimization.
arXiv Detail & Related papers (2020-03-23T21:05:15Z) - MOTS: Minimax Optimal Thompson Sampling [89.2370817955411]
It has remained an open problem whether Thompson sampling can match the minimax lower bound $Omega(sqrtKT)$ for $K$-armed bandit problems.
We propose a variant of Thompson sampling called MOTS that adaptively clips the sampling instance of the chosen arm at each time step.
We prove that this simple variant of Thompson sampling achieves the minimax optimal regret bound $O(sqrtKT)$ for finite time horizon $T$, as well as the optimal regret bound for Gaussian rewards when $T$ approaches infinity.
arXiv Detail & Related papers (2020-03-03T21:24:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.