Learning Contextual Bandits Through Perturbed Rewards
- URL: http://arxiv.org/abs/2201.09910v1
- Date: Mon, 24 Jan 2022 19:10:22 GMT
- Title: Learning Contextual Bandits Through Perturbed Rewards
- Authors: Yiling Jia, Weitong Zhang, Dongruo Zhou, Quanquan Gu, Hongning Wang
- Abstract summary: We show that a $tildeO(tildedsqrtT)$ regret upper bound is still achievable under standard regularity conditions.
We perturb the rewards when updating the neural network to eliminate the need of explicit exploration.
- Score: 107.6210145983805
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Thanks to the power of representation learning, neural contextual bandit
algorithms demonstrate remarkable performance improvement against their
classical counterparts. But because their exploration has to be performed in
the entire neural network parameter space to obtain nearly optimal regret, the
resulting computational cost is prohibitively high. We perturb the rewards when
updating the neural network to eliminate the need of explicit exploration and
the corresponding computational overhead. We prove that a
$\tilde{O}(\tilde{d}\sqrt{T})$ regret upper bound is still achievable under
standard regularity conditions, where $T$ is the number of rounds of
interactions and $\tilde{d}$ is the effective dimension of a neural tangent
kernel matrix. Extensive comparisons with several benchmark contextual bandit
algorithms, including two recent neural contextual bandit models, demonstrate
the effectiveness and computational efficiency of our proposed neural bandit
algorithm.
Related papers
- Combinatorial Neural Bandits [10.463365653675694]
We consider a contextual bandit problem where in each round a learning agent selects a subset of arms and receives feedback on the selected arms according to their scores.
We propose algorithms: Combinatorial Neural UCB ($textttCN-UCB) and Combinatorial Thompson Sampling ($textttCN-TS$)
arXiv Detail & Related papers (2023-05-31T23:27:58Z) - Efficiently Learning Any One Hidden Layer ReLU Network From Queries [27.428198343906352]
We give the first-time algorithm for learning arbitrary one hidden layer neural networks activations provided black-box access to the network.
Ours is the first with fully-time guarantees of efficiency even for worst-case networks.
arXiv Detail & Related papers (2021-11-08T18:59:40Z) - Anti-Concentrated Confidence Bonuses for Scalable Exploration [57.91943847134011]
Intrinsic rewards play a central role in handling the exploration-exploitation trade-off.
We introduce emphanti-concentrated confidence bounds for efficiently approximating the elliptical bonus.
We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic rewards on Atari benchmarks.
arXiv Detail & Related papers (2021-10-21T15:25:15Z) - Neural Contextual Bandits without Regret [47.73483756447701]
We propose algorithms for contextual bandits harnessing neural networks to approximate the unknown reward function.
We show that our approach converges to the optimal policy at a $tildemathcalO(T-1/2d)$ rate, where $d$ is the dimension of the context.
arXiv Detail & Related papers (2021-07-07T11:11:34Z) - Neural Contextual Bandits with Deep Representation and Shallow
Exploration [105.8099566651448]
We propose a novel learning algorithm that transforms the raw feature vector using the last hidden layer of a deep ReLU neural network.
Compared with existing neural contextual bandit algorithms, our approach is computationally much more efficient since it only needs to explore in the last layer of the deep neural network.
arXiv Detail & Related papers (2020-12-03T09:17:55Z) - Impact of Representation Learning in Linear Bandits [83.17684841392754]
We study how representation learning can improve the efficiency of bandit problems.
We present a new algorithm which achieves $widetildeO(TsqrtkN + sqrtdkNT)$ regret, where $N$ is the number of rounds we play for each bandit.
arXiv Detail & Related papers (2020-10-13T16:35:30Z) - Neural Thompson Sampling [94.82847209157494]
We propose a new algorithm, called Neural Thompson Sampling, which adapts deep neural networks for both exploration and exploitation.
At the core of our algorithm is a novel posterior distribution of the reward, where its mean is the neural network approximator, and its variance is built upon the neural tangent features of the corresponding neural network.
arXiv Detail & Related papers (2020-10-02T07:44:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.