Regularized OFU: an Efficient UCB Estimator forNon-linear Contextual
Bandit
- URL: http://arxiv.org/abs/2106.15128v1
- Date: Tue, 29 Jun 2021 07:28:15 GMT
- Title: Regularized OFU: an Efficient UCB Estimator forNon-linear Contextual
Bandit
- Authors: Yichi Zhou, Shihong Song, Huishuai Zhang, Jun Zhu, Wei Chen, Tie-Yan
Liu
- Abstract summary: Balancing exploration and exploitation (EE) is a fundamental problem in contex-tual bandit.
We propose a novel OFU algorithm namedregularized OFU(ROFU)
- Score: 90.0208037317206
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Balancing exploration and exploitation (EE) is a fundamental problem in
contex-tual bandit. One powerful principle for EE trade-off isOptimism in Face
of Uncer-tainty(OFU), in which the agent takes the action according to an upper
confidencebound (UCB) of reward. OFU has achieved (near-)optimal regret bound
for lin-ear/kernel contextual bandits. However, it is in general unknown how to
deriveefficient and effective EE trade-off methods for non-linearcomplex tasks,
suchas contextual bandit with deep neural network as the reward function. In
thispaper, we propose a novel OFU algorithm namedregularized OFU(ROFU). InROFU,
we measure the uncertainty of the reward by a differentiable function
andcompute the upper confidence bound by solving a regularized optimization
prob-lem. We prove that, for multi-armed bandit, kernel contextual bandit and
neuraltangent kernel bandit, ROFU achieves (near-)optimal regret bounds with
certainuncertainty measure, which theoretically justifies its effectiveness on
EE trade-off.Importantly, ROFU admits a very efficient implementation with
gradient-basedoptimizer, which easily extends to general deep neural network
models beyondneural tangent kernel, in sharp contrast with previous OFU
methods. The em-pirical evaluation demonstrates that ROFU works extremelywell
for contextualbandits under various settings.
Related papers
- Stochastic Bandits with ReLU Neural Networks [40.41457480347015]
We show that a $tildeO(sqrtT)$ regret guarantee is achievable by considering bandits with one-layer ReLU neural networks.
We propose an OFU-ReLU algorithm that can achieve this upper bound.
arXiv Detail & Related papers (2024-05-12T16:54:57Z) - REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world.
Recent methods aim to mitigate misalignment by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - $\alpha$-Fair Contextual Bandits [10.74025233418392]
Contextual bandit algorithms are at the core of many applications, including recommender systems, clinical trials, and optimal portfolio selection.
One of the most popular problems studied in the contextual bandit literature is to maximize the sum of the rewards in each round.
In this paper, we consider the $alpha$-Fairtextual Con Bandits problem, where the objective is to maximize the global $alpha$-fair utility function.
arXiv Detail & Related papers (2023-10-22T03:42:59Z) - On High-dimensional and Low-rank Tensor Bandits [53.0829344775769]
This work studies a general tensor bandits model, where actions and system parameters are represented by tensors as opposed to vectors.
A novel bandit algorithm, coined TOFU (Tensor Optimism in the Face of Uncertainty), is developed.
Theoretical analyses show that TOFU improves the best-known regret upper bound by a multiplicative factor that grows exponentially in the system order.
arXiv Detail & Related papers (2023-05-06T00:43:36Z) - Anti-Exploration by Random Network Distillation [63.04360288089277]
We show that a naive choice of conditioning for the Random Network Distillation (RND) is not discriminative enough to be used as an uncertainty estimator.
We show that this limitation can be avoided with conditioning based on Feature-wise Linear Modulation (FiLM)
We evaluate it on the D4RL benchmark, showing that it is capable of achieving performance comparable to ensemble-based methods and outperforming ensemble-free approaches by a wide margin.
arXiv Detail & Related papers (2023-01-31T13:18:33Z) - Contextual bandits with concave rewards, and an application to fair
ranking [108.48223948875685]
We present the first algorithm with provably vanishing regret for Contextual Bandits with Concave Rewards (CBCR)
We derive a novel reduction from the CBCR regret to the regret of a scalar-reward problem.
Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives.
arXiv Detail & Related papers (2022-10-18T16:11:55Z) - Anti-Concentrated Confidence Bonuses for Scalable Exploration [57.91943847134011]
Intrinsic rewards play a central role in handling the exploration-exploitation trade-off.
We introduce emphanti-concentrated confidence bounds for efficiently approximating the elliptical bonus.
We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic rewards on Atari benchmarks.
arXiv Detail & Related papers (2021-10-21T15:25:15Z) - Deep Upper Confidence Bound Algorithm for Contextual Bandit Ranking of
Information Selection [0.0]
Contextual multi-armed bandits (CMAB) have been widely used for learning to filter and prioritize information according to a user's interest.
In this work, we analyze top-K ranking under the CMAB framework where the top-K arms are chosen iteratively to maximize a reward.
We introduce a novel algorithm called the Deep Upper Confidence Bound (UCB) algorithm.
arXiv Detail & Related papers (2021-10-08T13:32:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.