Differentiable Linear Bandit Algorithm
- URL: http://arxiv.org/abs/2006.03000v1
- Date: Thu, 4 Jun 2020 16:43:55 GMT
- Title: Differentiable Linear Bandit Algorithm
- Authors: Kaige Yang and Laura Toni
- Abstract summary: Upper Confidence Bound is arguably the most commonly used method for linear multi-arm bandit problems.
We introduce a gradient estimator, which allows the confidence bound to be learned via gradient ascent.
We show that the proposed algorithm achieves a $tildemathcalO(hatbetasqrtdT)$ upper bound of $T$-round regret, where $d$ is the dimension of arm features and $hatbeta$ is the learned size of confidence bound.
- Score: 6.849358422233866
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Upper Confidence Bound (UCB) is arguably the most commonly used method for
linear multi-arm bandit problems. While conceptually and computationally
simple, this method highly relies on the confidence bounds, failing to strike
the optimal exploration-exploitation if these bounds are not properly set. In
the literature, confidence bounds are typically derived from concentration
inequalities based on assumptions on the reward distribution, e.g.,
sub-Gaussianity. The validity of these assumptions however is unknown in
practice. In this work, we aim at learning the confidence bound in a
data-driven fashion, making it adaptive to the actual problem structure.
Specifically, noting that existing UCB-typed algorithms are not differentiable
with respect to confidence bound, we first propose a novel differentiable
linear bandit algorithm. Then, we introduce a gradient estimator, which allows
the confidence bound to be learned via gradient ascent. Theoretically, we show
that the proposed algorithm achieves a
$\tilde{\mathcal{O}}(\hat{\beta}\sqrt{dT})$ upper bound of $T$-round regret,
where $d$ is the dimension of arm features and $\hat{\beta}$ is the learned
size of confidence bound. Empirical results show that $\hat{\beta}$ is
significantly smaller than its theoretical upper bound and proposed algorithms
outperforms baseline ones on both simulated and real-world datasets.
Related papers
- Noise-Adaptive Confidence Sets for Linear Bandits and Application to Bayesian Optimization [15.275864909088577]
Adapting to a priori unknown noise level is a very important but challenging problem in sequential decision-making.
We propose a novel confidence set that is semi-adaptive' to the unknown sub-Gaussian parameter $sigma_*2$.
For bounded rewards, we propose a novel variance-adaptive confidence set that has much improved numerical performance upon prior art.
arXiv Detail & Related papers (2024-02-12T00:19:09Z) - Estimating Optimal Policy Value in General Linear Contextual Bandits [50.008542459050155]
In many bandit problems, the maximal reward achievable by a policy is often unknown in advance.
We consider the problem of estimating the optimal policy value in the sublinear data regime before the optimal policy is even learnable.
We present a more practical, computationally efficient algorithm that estimates a problem-dependent upper bound on $V*$.
arXiv Detail & Related papers (2023-02-19T01:09:24Z) - Pseudonorm Approachability and Applications to Regret Minimization [73.54127663296906]
We convert high-dimensional $ell_infty$-approachability problems to low-dimensional pseudonorm approachability problems.
We develop an algorithmic theory of pseudonorm approachability, analogous to previous work on approachability for $ell$ and other norms.
arXiv Detail & Related papers (2023-02-03T03:19:14Z) - Optimal Regret Is Achievable with Bounded Approximate Inference Error:
An Enhanced Bayesian Upper Confidence Bound Framework [22.846260353176614]
We propose an Enhanced Bayesian Upper Confidence Bound (EBUCB) framework that can efficiently accommodate bandit problems.
We show that EBUCB can achieve the optimal regret order $O(log T)$ if the inference error measured by two different $alpha$-divergences is less than a constant.
Our study provides the first theoretical regret bound that is better than $o(T)$ in the setting of constant approximate inference error.
arXiv Detail & Related papers (2022-01-31T01:36:15Z) - Misspecified Gaussian Process Bandit Optimization [59.30399661155574]
Kernelized bandit algorithms have shown strong empirical and theoretical performance for this problem.
We introduce a emphmisspecified kernelized bandit setting where the unknown function can be $epsilon$--uniformly approximated by a function with a bounded norm in some Reproducing Kernel Hilbert Space (RKHS)
We show that our algorithm achieves optimal dependence on $epsilon$ with no prior knowledge of misspecification.
arXiv Detail & Related papers (2021-11-09T09:00:02Z) - Dealing With Misspecification In Fixed-Confidence Linear Top-m
Identification [0.0]
We study the problem of the identification of m arms with largest means under a fixed error rate $delta$ (fixed-confidence Top-m identification)
This problem is motivated by practical applications, especially in medicine and recommendation systems.
arXiv Detail & Related papers (2021-11-02T10:27:17Z) - Regret Lower Bound and Optimal Algorithm for High-Dimensional Contextual
Linear Bandit [10.604939762790517]
We prove a minimax lower bound, $mathcalObig((log d)fracalpha+12Tfrac1-alpha2+log Tbig)$, for the cumulative regret.
Second, we propose a simple and computationally efficient algorithm inspired by the general Upper Confidence Bound (UCB) strategy.
arXiv Detail & Related papers (2021-09-23T19:35:38Z) - Randomized Exploration for Reinforcement Learning with General Value
Function Approximation [122.70803181751135]
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm.
Our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises.
We complement the theory with an empirical evaluation across known difficult exploration tasks.
arXiv Detail & Related papers (2021-06-15T02:23:07Z) - Large-Scale Methods for Distributionally Robust Optimization [53.98643772533416]
We prove that our algorithms require a number of evaluations gradient independent of training set size and number of parameters.
Experiments on MNIST and ImageNet confirm the theoretical scaling of our algorithms, which are 9--36 times more efficient than full-batch methods.
arXiv Detail & Related papers (2020-10-12T17:41:44Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.