Norm-Agnostic Linear Bandits
- URL: http://arxiv.org/abs/2205.01257v1
- Date: Tue, 3 May 2022 00:55:42 GMT
- Title: Norm-Agnostic Linear Bandits
- Authors: Spencer (Brady) Gales, Sunder Sethuraman, Kwang-Sung Jun
- Abstract summary: We propose novel algorithms that do not require such knowledge for the first time.
We analyze their regret bounds: one for the changing arm set setting and the other for the fixed arm set setting.
Our numerical experiments show standard algorithms assuming knowledge of $S$ can fail catastrophically when $|theta*|le S$ is not true whereas our algorithms enjoy low regret.
- Score: 9.700635713828698
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Linear bandits have a wide variety of applications including recommendation
systems yet they make one strong assumption: the algorithms must know an upper
bound $S$ on the norm of the unknown parameter $\theta^*$ that governs the
reward generation. Such an assumption forces the practitioner to guess $S$
involved in the confidence bound, leaving no choice but to wish that
$\|\theta^*\|\le S$ is true to guarantee that the regret will be low. In this
paper, we propose novel algorithms that do not require such knowledge for the
first time. Specifically, we propose two algorithms and analyze their regret
bounds: one for the changing arm set setting and the other for the fixed arm
set setting. Our regret bound for the former shows that the price of not
knowing $S$ does not affect the leading term in the regret bound and inflates
only the lower order term. For the latter, we do not pay any price in the
regret for now knowing $S$. Our numerical experiments show standard algorithms
assuming knowledge of $S$ can fail catastrophically when $\|\theta^*\|\le S$ is
not true whereas our algorithms enjoy low regret.
Related papers
- Near-optimal Per-Action Regret Bounds for Sleeping Bandits [8.261117235807607]
We derive near-optimal per-action regret bounds for sleeping bandits.
In a setting with $K$ total arms and at most $A$ available arms in each round over $T$ rounds, the best known upper bound is $O(KsqrtTAlnK)$.
We extend our results to the setting of bandits with advice from sleeping experts, generalizing EXP4 along the way.
arXiv Detail & Related papers (2024-03-02T21:22:46Z) - Low-Rank Bandits via Tight Two-to-Infinity Singular Subspace Recovery [45.601316850669406]
We present efficient algorithms for policy evaluation, best policy identification and regret minimization.
For policy evaluation and best policy identification, we show that our algorithms are nearly minimax optimal.
All the proposed algorithms consist of two phases: they first leverage spectral methods to estimate the left and right singular subspaces of the low-rank reward matrix.
arXiv Detail & Related papers (2024-02-24T06:36:08Z) - Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
We show that our algorithm achieves an $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ are the cardinalities of the state and action spaces
arXiv Detail & Related papers (2023-05-15T05:37:32Z) - Private Online Prediction from Experts: Separations and Faster Rates [74.52487417350221]
Online prediction from experts is a fundamental problem in machine learning and several works have studied this problem under privacy constraints.
We propose and analyze new algorithms for this problem that improve over the regret bounds of the best existing algorithms for non-adaptive adversaries.
arXiv Detail & Related papers (2022-10-24T18:40:19Z) - Variance-Aware Sparse Linear Bandits [64.70681598741417]
Worst-case minimax regret for sparse linear bandits is $widetildeThetaleft(sqrtdTright)$.
In the benign setting where there is no noise and the action set is the unit sphere, one can use divide-and-conquer to achieve an $widetildemathcal O(1)$ regret.
We develop a general framework that converts any variance-aware linear bandit algorithm to a variance-aware algorithm for sparse linear bandits.
arXiv Detail & Related papers (2022-05-26T15:55:44Z) - Corralling a Larger Band of Bandits: A Case Study on Switching Regret
for Linear Bandits [99.86860277006318]
We consider the problem of combining and learning over a set of adversarial algorithms with the goal of adaptively tracking the best one on the fly.
The CORRAL of Agarwal et al. achieves this goal with a regret overhead of order $widetildeO(sqrtd S T)$ where $M$ is the number of base algorithms and $T$ is the time horizon.
Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only emphlogarithmic dependence on $M$ as long
arXiv Detail & Related papers (2022-02-12T21:55:44Z) - Non-Stationary Dueling Bandits [8.298716599039501]
We study the non-stationary dueling bandits problem with $K$ arms, where the time horizon $T$ consists of $M$ stationary segments.
To minimize the accumulated regret, the learner needs to pick the Condorcet winner of each stationary segment as often as possible.
We show a regret bound for the non-stationary case, without requiring knowledge of $M$ or $T$.
arXiv Detail & Related papers (2022-02-02T09:57:35Z) - Logarithmic Regret from Sublinear Hints [76.87432703516942]
We show that an algorithm can obtain $O(log T)$ regret with just $O(sqrtT)$ hints under a natural query model.
We also show that $o(sqrtT)$ hints cannot guarantee better than $Omega(sqrtT)$ regret.
arXiv Detail & Related papers (2021-11-09T16:50:18Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Low-Rank Generalized Linear Bandit Problems [19.052901817304743]
In a low-rank linear bandit problem, the reward of an action is the inner product between the action and an unknown low-rank matrix $Theta*$.
We propose an algorithm based on a novel combination of online-to-confidence-set conversioncitepabbasi2012online and the exponentially weighted average forecaster constructed by a covering of low-rank matrices.
arXiv Detail & Related papers (2020-06-04T15:30:14Z) - Problem-Complexity Adaptive Model Selection for Stochastic Linear
Bandits [20.207989166682832]
We consider the problem of model selection for two popular linear bandit settings.
In the first setting, we consider the $K$ armed mixture bandits, where the mean reward of arm $i in [K]$ is $mu_i+ langle alpha_i,t,theta*|$.
We show that ALB achieves regret scaling of $O(|theta*|sqrtT)$, where $|theta*|$ is apriori unknown
arXiv Detail & Related papers (2020-06-04T02:19:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.