Leveraging Initial Hints for Free in Stochastic Linear Bandits
- URL: http://arxiv.org/abs/2203.04274v1
- Date: Tue, 8 Mar 2022 18:48:55 GMT
- Title: Leveraging Initial Hints for Free in Stochastic Linear Bandits
- Authors: Ashok Cutkosky, Chris Dann, Abhimanyu Das, Qiuyi (Richard) Zhang
- Abstract summary: We study the setting of optimizing with bandit feedback with additional prior knowledge provided to a learner in the form of an initial hint of the optimal action.
We present a novel algorithm for linear bandits that uses this hint to improve its regret to $tilde O(sqrtT)$ when the hint is accurate.
Perhaps surprisingly, our work shows that leveraging a hint shows provable gains without sacrificing worst-case performance.
- Score: 38.58449930961883
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the setting of optimizing with bandit feedback with additional prior
knowledge provided to the learner in the form of an initial hint of the optimal
action. We present a novel algorithm for stochastic linear bandits that uses
this hint to improve its regret to $\tilde O(\sqrt{T})$ when the hint is
accurate, while maintaining a minimax-optimal $\tilde O(d\sqrt{T})$ regret
independent of the quality of the hint. Furthermore, we provide a Pareto
frontier of tight tradeoffs between best-case and worst-case regret, with
matching lower bounds. Perhaps surprisingly, our work shows that leveraging a
hint shows provable gains without sacrificing worst-case performance, implying
that our algorithm adapts to the quality of the hint for free. We also provide
an extension of our algorithm to the case of $m$ initial hints, showing that we
can achieve a $\tilde O(m^{2/3}\sqrt{T})$ regret.
Related papers
- Incentive-compatible Bandits: Importance Weighting No More [14.344759978208957]
We study the problem of incentive-compatible online learning with bandit feedback.
In this work we propose the first incentive-compatible algorithms that enjoy $O(sqrtKT)$ regret bounds.
arXiv Detail & Related papers (2024-05-10T13:57:13Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Almost Optimal Batch-Regret Tradeoff for Batch Linear Contextual Bandits [45.43968161616453]
We study the optimal batch-regret tradeoff for batch linear contextual bandits.
We prove its regret guarantee, which features a two-phase expression as the time horizon $T$ grows.
We also prove a new matrix inequality concentration with dependence on their dynamic upper bounds.
arXiv Detail & Related papers (2021-10-15T12:32:33Z) - Randomized Exploration for Reinforcement Learning with General Value
Function Approximation [122.70803181751135]
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm.
Our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises.
We complement the theory with an empirical evaluation across known difficult exploration tasks.
arXiv Detail & Related papers (2021-06-15T02:23:07Z) - Multinomial Logit Contextual Bandits: Provable Optimality and
Practicality [15.533842336139063]
We consider a sequential assortment selection problem where the user choice is given by a multinomial logit (MNL) choice model whose parameters are unknown.
We propose upper confidence bound based algorithms for this MNL contextual bandit.
We show that a simple variant of the algorithm achieves the optimal regret for a broad class of important applications.
arXiv Detail & Related papers (2021-03-25T15:42:25Z) - Minimax Regret for Stochastic Shortest Path with Adversarial Costs and
Known Transition [37.6975819766632]
We study the shortest path problem with adversarial costs and known transition.
We show that the minimax regret is $widetildeO(sqrtDTstar K)$ and $widetildeO(sqrtDTstar SA K)$ for the full-information setting and the bandit feedback setting.
arXiv Detail & Related papers (2020-12-07T20:55:28Z) - Thresholded Lasso Bandit [70.17389393497125]
Thresholded Lasso bandit is an algorithm that estimates the vector defining the reward function as well as its sparse support.
We establish non-asymptotic regret upper bounds scaling as $mathcalO( log d + sqrtT )$ in general, and as $mathcalO( log d + sqrtT )$ under the so-called margin condition.
arXiv Detail & Related papers (2020-10-22T19:14:37Z) - Efficient Optimistic Exploration in Linear-Quadratic Regulators via
Lagrangian Relaxation [107.06364966905821]
We study the exploration-exploitation dilemma in the linear quadratic regulator (LQR) setting.
Inspired by the extended value iteration algorithm used in optimistic algorithms for finite MDPs, we propose to relax the optimistic optimization of ofulq.
We show that an $epsilon$-optimistic controller can be computed efficiently by solving at most $Obig(log (1/epsilon)big)$ Riccati equations.
arXiv Detail & Related papers (2020-07-13T16:30:47Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.