Almost Optimal Batch-Regret Tradeoff for Batch Linear Contextual Bandits
- URL: http://arxiv.org/abs/2110.08057v1
- Date: Fri, 15 Oct 2021 12:32:33 GMT
- Title: Almost Optimal Batch-Regret Tradeoff for Batch Linear Contextual Bandits
- Authors: Zihan Zhang, Xiangyang Ji, Yuan Zhou
- Abstract summary: We study the optimal batch-regret tradeoff for batch linear contextual bandits.
We prove its regret guarantee, which features a two-phase expression as the time horizon $T$ grows.
We also prove a new matrix inequality concentration with dependence on their dynamic upper bounds.
- Score: 45.43968161616453
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the optimal batch-regret tradeoff for batch linear contextual
bandits. For any batch number $M$, number of actions $K$, time horizon $T$, and
dimension $d$, we provide an algorithm and prove its regret guarantee, which,
due to technical reasons, features a two-phase expression as the time horizon
$T$ grows. We also prove a lower bound theorem that surprisingly shows the
optimality of our two-phase regret upper bound (up to logarithmic factors) in
the \emph{full range} of the problem parameters, therefore establishing the
exact batch-regret tradeoff.
Compared to the recent work \citep{ruan2020linear} which showed that $M =
O(\log \log T)$ batches suffice to achieve the asymptotically minimax-optimal
regret without the batch constraints, our algorithm is simpler and easier for
practical implementation. Furthermore, our algorithm achieves the optimal
regret for all $T \geq d$, while \citep{ruan2020linear} requires that $T$
greater than an unrealistically large polynomial of $d$.
Along our analysis, we also prove a new matrix concentration inequality with
dependence on their dynamic upper bounds, which, to the best of our knowledge,
is the first of its kind in literature and maybe of independent interest.
Related papers
- Optimal Batched Linear Bandits [9.795531604467406]
We introduce the E$4$ algorithm for the batched linear bandit problem, incorporating an Explore-Estimate-Exploit framework.
With a proper choice of exploration rate, we prove E$4$ achieves finite--time minimax optimal regret with only $O(loglog T)$ batches.
We show that with another choice of exploration rate E$4$ achieves an instance-dependent regret bound requiring at most $O(log T)$ batches.
arXiv Detail & Related papers (2024-06-06T14:57:52Z) - LC-Tsallis-INF: Generalized Best-of-Both-Worlds Linear Contextual Bandits [38.41164102066483]
This study considers the linear contextual bandit problem with independent and identically distributed contexts.
Our proposed algorithm is based on the Follow-The-Regularized-Leader with the Tsallis entropy and referred to as the $alpha$-textual-Con (LC)-Tsallis-INF.
arXiv Detail & Related papers (2024-03-05T18:59:47Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Parameter-free Regret in High Probability with Heavy Tails [45.11667443216861]
We present new algorithms for online convex optimization over unbounded domains.
We obtain parameter-free regret in high-probability given access only to potentially heavy-tailed subgradient estimates.
arXiv Detail & Related papers (2022-10-25T21:43:44Z) - Nearly Minimax Algorithms for Linear Bandits with Shared Representation [86.79657561369397]
We consider the setting where we play $M$ linear bandits with dimension $d$, each for $T$ rounds, and these $M$ bandit tasks share a common $k(ll d)$ dimensional linear representation.
We come up with novel algorithms that achieve $widetildeOleft(dsqrtkMT + kMsqrtTright)$ regret bounds, which matches the known minimax regret lower bound up to logarithmic factors.
arXiv Detail & Related papers (2022-03-29T15:27:13Z) - Gaussian Process Bandit Optimization with Few Batches [49.896920704012395]
We introduce a batch algorithm inspired by finite-arm bandit algorithms.
We show that it achieves the cumulative regret upper bound $Oast(sqrtTgamma_T)$ using $O(loglog T)$ batches within time horizon $T$.
In addition, we propose a modified version of our algorithm, and characterize how the regret is impacted by the number of batches.
arXiv Detail & Related papers (2021-10-15T00:54:04Z) - Regret Lower Bound and Optimal Algorithm for High-Dimensional Contextual
Linear Bandit [10.604939762790517]
We prove a minimax lower bound, $mathcalObig((log d)fracalpha+12Tfrac1-alpha2+log Tbig)$, for the cumulative regret.
Second, we propose a simple and computationally efficient algorithm inspired by the general Upper Confidence Bound (UCB) strategy.
arXiv Detail & Related papers (2021-09-23T19:35:38Z) - Thresholded Lasso Bandit [70.17389393497125]
Thresholded Lasso bandit is an algorithm that estimates the vector defining the reward function as well as its sparse support.
We establish non-asymptotic regret upper bounds scaling as $mathcalO( log d + sqrtT )$ in general, and as $mathcalO( log d + sqrtT )$ under the so-called margin condition.
arXiv Detail & Related papers (2020-10-22T19:14:37Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.