Neural Thompson Sampling
- URL: http://arxiv.org/abs/2010.00827v2
- Date: Thu, 30 Dec 2021 09:21:04 GMT
- Title: Neural Thompson Sampling
- Authors: Weitong Zhang and Dongruo Zhou and Lihong Li and Quanquan Gu
- Abstract summary: We propose a new algorithm, called Neural Thompson Sampling, which adapts deep neural networks for both exploration and exploitation.
At the core of our algorithm is a novel posterior distribution of the reward, where its mean is the neural network approximator, and its variance is built upon the neural tangent features of the corresponding neural network.
- Score: 94.82847209157494
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Thompson Sampling (TS) is one of the most effective algorithms for solving
contextual multi-armed bandit problems. In this paper, we propose a new
algorithm, called Neural Thompson Sampling, which adapts deep neural networks
for both exploration and exploitation. At the core of our algorithm is a novel
posterior distribution of the reward, where its mean is the neural network
approximator, and its variance is built upon the neural tangent features of the
corresponding neural network. We prove that, provided the underlying reward
function is bounded, the proposed algorithm is guaranteed to achieve a
cumulative regret of $\mathcal{O}(T^{1/2})$, which matches the regret of other
contextual bandit algorithms in terms of total round number $T$. Experimental
comparisons with other benchmark bandit algorithms on various data sets
corroborate our theory.
Related papers
- Feel-Good Thompson Sampling for Contextual Dueling Bandits [49.450050682705026]
We propose a Thompson sampling algorithm, named FGTS.CDB, for linear contextual dueling bandits.
At the core of our algorithm is a new Feel-Good exploration term specifically tailored for dueling bandits.
Our algorithm achieves nearly minimax-optimal regret, i.e., $tildemathcalO(dsqrt T)$, where $d$ is the model dimension and $T$ is the time horizon.
arXiv Detail & Related papers (2024-04-09T04:45:18Z) - Neural Exploitation and Exploration of Contextual Bandits [51.25537742455235]
We study utilizing neural networks for the exploitation and exploration of contextual multi-armed bandits.
EE-Net is a novel neural-based exploitation and exploration strategy.
We show that EE-Net outperforms related linear and neural contextual bandit baselines on real-world datasets.
arXiv Detail & Related papers (2023-05-05T18:34:49Z) - Learning Contextual Bandits Through Perturbed Rewards [107.6210145983805]
We show that a $tildeO(tildedsqrtT)$ regret upper bound is still achievable under standard regularity conditions.
We perturb the rewards when updating the neural network to eliminate the need of explicit exploration.
arXiv Detail & Related papers (2022-01-24T19:10:22Z) - Efficiently Learning Any One Hidden Layer ReLU Network From Queries [27.428198343906352]
We give the first-time algorithm for learning arbitrary one hidden layer neural networks activations provided black-box access to the network.
Ours is the first with fully-time guarantees of efficiency even for worst-case networks.
arXiv Detail & Related papers (2021-11-08T18:59:40Z) - EE-Net: Exploitation-Exploration Neural Networks in Contextual Bandits [52.98326168071513]
"EE-Net" is a neural-based bandit approach with a novel exploration strategy.
We show that EE-Net achieves $mathcalO(sqrtTlog T)$ regret, which is tighter than existing state-of-the-art neural bandit algorithms.
arXiv Detail & Related papers (2021-10-07T04:12:36Z) - Neural Contextual Bandits without Regret [47.73483756447701]
We propose algorithms for contextual bandits harnessing neural networks to approximate the unknown reward function.
We show that our approach converges to the optimal policy at a $tildemathcalO(T-1/2d)$ rate, where $d$ is the dimension of the context.
arXiv Detail & Related papers (2021-07-07T11:11:34Z) - Thompson Sampling for Unimodal Bandits [21.514495320038712]
We propose a Thompson Sampling algorithm for emphunimodal bandits, where the expected reward is unimodal over the partially ordered arms.
For Gaussian rewards, the regret of our algorithm is $mathcalO(log T)$, which is far better than standard Thompson Sampling algorithms.
arXiv Detail & Related papers (2021-06-15T14:40:34Z) - Neural Contextual Bandits with Deep Representation and Shallow
Exploration [105.8099566651448]
We propose a novel learning algorithm that transforms the raw feature vector using the last hidden layer of a deep ReLU neural network.
Compared with existing neural contextual bandit algorithms, our approach is computationally much more efficient since it only needs to explore in the last layer of the deep neural network.
arXiv Detail & Related papers (2020-12-03T09:17:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.