Related papers: Regularized OFU: an Efficient UCB Estimator forNon-linear Contextual Bandit

Regularized OFU: an Efficient UCB Estimator forNon-linear Contextual Bandit

URL: http://arxiv.org/abs/2106.15128v1
Date: Tue, 29 Jun 2021 07:28:15 GMT
Title: Regularized OFU: an Efficient UCB Estimator forNon-linear Contextual Bandit
Authors: Yichi Zhou, Shihong Song, Huishuai Zhang, Jun Zhu, Wei Chen, Tie-Yan Liu
Abstract summary: Balancing exploration and exploitation (EE) is a fundamental problem in contex-tual bandit. We propose a novel OFU algorithm namedregularized OFU(ROFU)
Score: 90.0208037317206
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Balancing exploration and exploitation (EE) is a fundamental problem in contex-tual bandit. One powerful principle for EE trade-off isOptimism in Face of Uncer-tainty(OFU), in which the agent takes the action according to an upper confidencebound (UCB) of reward. OFU has achieved (near-)optimal regret bound for lin-ear/kernel contextual bandits. However, it is in general unknown how to deriveefficient and effective EE trade-off methods for non-linearcomplex tasks, suchas contextual bandit with deep neural network as the reward function. In thispaper, we propose a novel OFU algorithm namedregularized OFU(ROFU). InROFU, we measure the uncertainty of the reward by a differentiable function andcompute the upper confidence bound by solving a regularized optimization prob-lem. We prove that, for multi-armed bandit, kernel contextual bandit and neuraltangent kernel bandit, ROFU achieves (near-)optimal regret bounds with certainuncertainty measure, which theoretically justifies its effectiveness on EE trade-off.Importantly, ROFU admits a very efficient implementation with gradient-basedoptimizer, which easily extends to general deep neural network models beyondneural tangent kernel, in sharp contrast with previous OFU methods. The em-pirical evaluation demonstrates that ROFU works extremelywell for contextualbandits under various settings.

Related papers

Supervised Optimism Correction: Be Confident When LLMs Are Sure [91.7459076316849]
We establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning. We show that the widely used beam search method suffers from unacceptable over-optimism. We propose Supervised Optimism Correction, which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations.
arXiv Detail & Related papers (2025-04-10T07:50:03Z)
Stochastic Bandits with ReLU Neural Networks [40.41457480347015]
We show that a $tildeO(sqrtT)$ regret guarantee is achievable by considering bandits with one-layer ReLU neural networks. We propose an OFU-ReLU algorithm that can achieve this upper bound.
arXiv Detail & Related papers (2024-05-12T16:54:57Z)
REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world. Current methods to mitigate this misalignment work by learning reward functions from human preferences. We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
$\alpha$-Fair Contextual Bandits [10.74025233418392]
Contextual bandit algorithms are at the core of many applications, including recommender systems, clinical trials, and optimal portfolio selection. One of the most popular problems studied in the contextual bandit literature is to maximize the sum of the rewards in each round. In this paper, we consider the $alpha$-Fairtextual Con Bandits problem, where the objective is to maximize the global $alpha$-fair utility function.
arXiv Detail & Related papers (2023-10-22T03:42:59Z)
On High-dimensional and Low-rank Tensor Bandits [53.0829344775769]
This work studies a general tensor bandits model, where actions and system parameters are represented by tensors as opposed to vectors. A novel bandit algorithm, coined TOFU (Tensor Optimism in the Face of Uncertainty), is developed. Theoretical analyses show that TOFU improves the best-known regret upper bound by a multiplicative factor that grows exponentially in the system order.
arXiv Detail & Related papers (2023-05-06T00:43:36Z)
Anti-Exploration by Random Network Distillation [63.04360288089277]
We show that a naive choice of conditioning for the Random Network Distillation (RND) is not discriminative enough to be used as an uncertainty estimator. We show that this limitation can be avoided with conditioning based on Feature-wise Linear Modulation (FiLM) We evaluate it on the D4RL benchmark, showing that it is capable of achieving performance comparable to ensemble-based methods and outperforming ensemble-free approaches by a wide margin.
arXiv Detail & Related papers (2023-01-31T13:18:33Z)
Contextual bandits with concave rewards, and an application to fair ranking [108.48223948875685]
We present the first algorithm with provably vanishing regret for Contextual Bandits with Concave Rewards (CBCR) We derive a novel reduction from the CBCR regret to the regret of a scalar-reward problem. Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives.
arXiv Detail & Related papers (2022-10-18T16:11:55Z)
Anti-Concentrated Confidence Bonuses for Scalable Exploration [57.91943847134011]
Intrinsic rewards play a central role in handling the exploration-exploitation trade-off. We introduce emphanti-concentrated confidence bounds for efficiently approximating the elliptical bonus. We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic rewards on Atari benchmarks.
arXiv Detail & Related papers (2021-10-21T15:25:15Z)
Deep Upper Confidence Bound Algorithm for Contextual Bandit Ranking of Information Selection [0.0]
Contextual multi-armed bandits (CMAB) have been widely used for learning to filter and prioritize information according to a user's interest. In this work, we analyze top-K ranking under the CMAB framework where the top-K arms are chosen iteratively to maximize a reward. We introduce a novel algorithm called the Deep Upper Confidence Bound (UCB) algorithm.
arXiv Detail & Related papers (2021-10-08T13:32:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.