Bypassing the Monster: A Faster and Simpler Optimal Algorithm for
Contextual Bandits under Realizability
- URL: http://arxiv.org/abs/2003.12699v5
- Date: Sat, 10 Jul 2021 04:40:21 GMT
- Title: Bypassing the Monster: A Faster and Simpler Optimal Algorithm for
Contextual Bandits under Realizability
- Authors: David Simchi-Levi, Yunzong Xu
- Abstract summary: We design a fast and simple algorithm that achieves the statistically optimal regret with only $O(log T)$ calls to an offline regression oracle.
Our results provide the first universal and optimal reduction from contextual bandits to offline regression.
- Score: 18.45278329799526
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the general (stochastic) contextual bandit problem under the
realizability assumption, i.e., the expected reward, as a function of contexts
and actions, belongs to a general function class $\mathcal{F}$. We design a
fast and simple algorithm that achieves the statistically optimal regret with
only ${O}(\log T)$ calls to an offline regression oracle across all $T$ rounds.
The number of oracle calls can be further reduced to $O(\log\log T)$ if $T$ is
known in advance. Our results provide the first universal and optimal reduction
from contextual bandits to offline regression, solving an important open
problem in the contextual bandit literature. A direct consequence of our
results is that any advances in offline regression immediately translate to
contextual bandits, statistically and computationally. This leads to faster
algorithms and improved regret guarantees for broader classes of contextual
bandit problems.
Related papers
- Contexts can be Cheap: Solving Stochastic Contextual Bandits with Linear
Bandit Algorithms [39.70492757288025]
We address the contextual linear bandit problem, where a decision maker is provided a context.
We show that the contextual problem can be solved as a linear bandit problem.
Our results imply a $O(dsqrtTlog T)$ high-probability regret bound for contextual linear bandits.
arXiv Detail & Related papers (2022-11-08T22:18:53Z) - Optimal Contextual Bandits with Knapsacks under Realizibility via
Regression Oracles [14.634964681825197]
We study the contextual bandit with knapsacks (CBwK) problem, where each action leads to a random reward but also costs a random resource consumption in a vector form.
We propose the first universal and optimal algorithmic framework for CBwK by reducing it to online regression.
arXiv Detail & Related papers (2022-10-21T09:28:53Z) - Improved High-Probability Regret for Adversarial Bandits with
Time-Varying Feedback Graphs [62.52390282012508]
We study high-probability regret bounds for adversarial $K$-armed bandits with time-varying feedback graphs over $T$ rounds.
We develop an algorithm that achieves the optimal regret $widetildemathcalO((sum_t=1Talpha_t)1/2+max_tin[T]alpha_t]$ with high probability.
We also develop the first algorithm that achieves the optimal high-probability regret bound for weakly observable graphs.
arXiv Detail & Related papers (2022-10-04T04:36:15Z) - Corralling a Larger Band of Bandits: A Case Study on Switching Regret
for Linear Bandits [99.86860277006318]
We consider the problem of combining and learning over a set of adversarial algorithms with the goal of adaptively tracking the best one on the fly.
The CORRAL of Agarwal et al. achieves this goal with a regret overhead of order $widetildeO(sqrtd S T)$ where $M$ is the number of base algorithms and $T$ is the time horizon.
Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only emphlogarithmic dependence on $M$ as long
arXiv Detail & Related papers (2022-02-12T21:55:44Z) - Efficient and Optimal Algorithms for Contextual Dueling Bandits under
Realizability [59.81339109121384]
We study the $K$ contextual dueling bandit problem, a sequential decision making setting in which the learner uses contextual information to make two decisions, but only observes emphpreference-based feedback suggesting that one decision was better than the other.
We provide a new algorithm that achieves the optimal regret rate for a new notion of best response regret, which is a strictly stronger performance measure than those considered in prior works.
arXiv Detail & Related papers (2021-11-24T07:14:57Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - An Efficient Algorithm For Generalized Linear Bandit: Online Stochastic
Gradient Descent and Thompson Sampling [83.48992319018147]
We consider the contextual bandit problem, where a player sequentially makes decisions based on past observations to maximize the cumulative reward.
A natural way to resolve this problem is to apply online gradient descent (SGD) so that the per-step time and memory complexity can be reduced to constant.
In this work, we show that online SGD can be applied to the generalized linear bandit problem.
The proposed SGD-TS algorithm, which uses a single-step SGD update to exploit past information, achieves $tildeO(sqrtT)$ regret with the total time complexity that
arXiv Detail & Related papers (2020-06-07T01:12:39Z) - Beyond UCB: Optimal and Efficient Contextual Bandits with Regression
Oracles [112.89548995091182]
We provide the first universal and optimal reduction from contextual bandits to online regression.
Our algorithm requires no distributional assumptions beyond realizability, and works even when contexts are chosen adversarially.
arXiv Detail & Related papers (2020-02-12T11:33:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.