Deep Upper Confidence Bound Algorithm for Contextual Bandit Ranking of
Information Selection
- URL: http://arxiv.org/abs/2110.04127v1
- Date: Fri, 8 Oct 2021 13:32:14 GMT
- Title: Deep Upper Confidence Bound Algorithm for Contextual Bandit Ranking of
Information Selection
- Authors: Michael Rawson, Jade Freeman
- Abstract summary: Contextual multi-armed bandits (CMAB) have been widely used for learning to filter and prioritize information according to a user's interest.
In this work, we analyze top-K ranking under the CMAB framework where the top-K arms are chosen iteratively to maximize a reward.
We introduce a novel algorithm called the Deep Upper Confidence Bound (UCB) algorithm.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Contextual multi-armed bandits (CMAB) have been widely used for learning to
filter and prioritize information according to a user's interest. In this work,
we analyze top-K ranking under the CMAB framework where the top-K arms are
chosen iteratively to maximize a reward. The context, which represents a set of
observable factors related to the user, is used to increase prediction accuracy
compared to a standard multi-armed bandit. Contextual bandit methods have
mostly been studied under strict linearity assumptions, but we drop that
assumption and learn non-linear stochastic reward functions with deep neural
networks. We introduce a novel algorithm called the Deep Upper Confidence Bound
(UCB) algorithm. Deep UCB balances exploration and exploitation with a separate
neural network to model the learning convergence. We compare the performance of
many bandit algorithms varying K over real-world data sets with
high-dimensional data and non-linear reward functions. Empirical results show
that the performance of Deep UCB often outperforms though it is sensitive to
the problem and reward setup. Additionally, we prove theoretical regret bounds
on Deep UCB giving convergence to optimality for the weak class of CMAB
problems.
Related papers
- Continuous K-Max Bandits [54.21533414838677]
We study the $K$-Max multi-armed bandits problem with continuous outcome distributions and weak value-index feedback.
This setting captures critical applications in recommendation systems, distributed computing, server scheduling, etc.
Our key contribution is the computationally efficient algorithm DCK-UCB, which combines adaptive discretization with bias-corrected confidence bounds.
arXiv Detail & Related papers (2025-02-19T06:37:37Z) - Online Clustering of Dueling Bandits [59.09590979404303]
We introduce the first "clustering of dueling bandit algorithms" to enable collaborative decision-making based on preference feedback.
We propose two novel algorithms: (1) Clustering of Linear Dueling Bandits (COLDB) which models the user reward functions as linear functions of the context vectors, and (2) Clustering of Neural Dueling Bandits (CONDB) which uses a neural network to model complex, non-linear user reward functions.
arXiv Detail & Related papers (2025-02-04T07:55:41Z) - UCB algorithms for multi-armed bandits: Precise regret and adaptive inference [6.349503549199403]
Upper Confidence Bound (UCB) algorithms are a widely-used class of sequential algorithms for the $K$-armed bandit problem.
We show that the Lai-Robbins regret formula is exact if and only if the sub-optimality gaps exceed the order $sigmasqrtKlog T/T$.
We also show that its maximal regret deviates from the minimax regret by a logarithmic factor, and therefore settling its strict minimax optimality in the negative.
arXiv Detail & Related papers (2024-12-09T01:14:02Z) - Neural Dueling Bandits [58.90189511247936]
We use a neural network to estimate the reward function using preference feedback for the previously selected arms.
We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution.
arXiv Detail & Related papers (2024-07-24T09:23:22Z) - Online Clustering of Bandits with Misspecified User Models [42.56440072468658]
Contextual linear bandit is an online learning problem where given arm features, a learning agent selects an arm at each round to maximize the cumulative rewards in the long run.
A line of works, called the clustering of bandits (CB), utilize the collaborative effect over user preferences and have shown significant improvements over classic linear bandit algorithms.
In this paper, we are the first to present the important problem of clustering of bandits with misspecified user models (CBMUM).
We devise two robust CB algorithms, RCLUMB and RSCLUMB, that can accommodate the inaccurate user preference estimations and erroneous clustering caused by model misspecifications.
arXiv Detail & Related papers (2023-10-04T10:40:50Z) - Neural Contextual Bandits without Regret [47.73483756447701]
We propose algorithms for contextual bandits harnessing neural networks to approximate the unknown reward function.
We show that our approach converges to the optimal policy at a $tildemathcalO(T-1/2d)$ rate, where $d$ is the dimension of the context.
arXiv Detail & Related papers (2021-07-07T11:11:34Z) - Bayesian decision-making under misspecified priors with applications to
meta-learning [64.38020203019013]
Thompson sampling and other sequential decision-making algorithms are popular approaches to tackle explore/exploit trade-offs in contextual bandits.
We show that performance degrades gracefully with misspecified priors.
arXiv Detail & Related papers (2021-07-03T23:17:26Z) - Neural Contextual Bandits with Deep Representation and Shallow
Exploration [105.8099566651448]
We propose a novel learning algorithm that transforms the raw feature vector using the last hidden layer of a deep ReLU neural network.
Compared with existing neural contextual bandit algorithms, our approach is computationally much more efficient since it only needs to explore in the last layer of the deep neural network.
arXiv Detail & Related papers (2020-12-03T09:17:55Z) - The Unreasonable Effectiveness of Greedy Algorithms in Multi-Armed Bandit with Many Arms [10.662105162882526]
We investigate a $k$-armed bandit problem in the emphmany-armed regime.
Our findings suggest a new form of emphfree exploration beneficial to greedy algorithms in the many-armed context.
arXiv Detail & Related papers (2020-02-24T08:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.