Provably Efficient High-Dimensional Bandit Learning with Batched
Feedbacks
- URL: http://arxiv.org/abs/2311.13180v2
- Date: Fri, 24 Nov 2023 18:31:21 GMT
- Title: Provably Efficient High-Dimensional Bandit Learning with Batched
Feedbacks
- Authors: Jianqing Fan, Zhaoran Wang, Zhuoran Yang, and Chenlu Ye
- Abstract summary: We study high-dimensional multi-armed contextual bandits with batched feedback where the $T$ steps of online interactions are divided into $L$ batches.
In specific, each batch collects data according to a policy that depends on previous batches and the rewards are revealed only at the end of the batch.
Our algorithm achieves regret bounds comparable to those in fully sequential setting with only $mathcalO( log T)$ batches.
- Score: 93.00280593719513
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study high-dimensional multi-armed contextual bandits with batched
feedback where the $T$ steps of online interactions are divided into $L$
batches. In specific, each batch collects data according to a policy that
depends on previous batches and the rewards are revealed only at the end of the
batch. Such a feedback structure is popular in applications such as
personalized medicine and online advertisement, where the online data often do
not arrive in a fully serial manner. We consider high-dimensional and linear
settings where the reward function of the bandit model admits either a sparse
or low-rank structure and ask how small a number of batches are needed for a
comparable performance with fully dynamic data in which $L = T$. For these
settings, we design a provably sample-efficient algorithm which achieves a $
\mathcal{\tilde O}(s_0^2 \log^2 T)$ regret in the sparse case and $
\mathcal{\tilde O} ( r ^2 \log^2 T)$ regret in the low-rank case, using only $L
= \mathcal{O}( \log T)$ batches. Here $s_0$ and $r$ are the sparsity and rank
of the reward parameter in sparse and low-rank cases, respectively, and $
\mathcal{\tilde O}(\cdot)$ omits logarithmic factors involving the feature
dimensions. In other words, our algorithm achieves regret bounds comparable to
those in fully sequential setting with only $\mathcal{O}( \log T)$ batches. Our
algorithm features a novel batch allocation method that adjusts the batch sizes
according to the estimation accuracy within each batch and cumulative regret.
Furthermore, we also conduct experiments with synthetic and real-world data to
validate our theory.
Related papers
- Fast Rates for Bandit PAC Multiclass Classification [73.17969992976501]
We study multiclass PAC learning with bandit feedback, where inputs are classified into one of $K$ possible labels and feedback is limited to whether or not the predicted labels are correct.
Our main contribution is in designing a novel learning algorithm for the agnostic $(varepsilon,delta)$PAC version of the problem.
arXiv Detail & Related papers (2024-06-18T08:54:04Z) - Cooperative Thresholded Lasso for Sparse Linear Bandit [6.52540785559241]
We present a novel approach to address the multi-agent sparse contextual linear bandit problem.
It is first algorithm that tackles row-wise distributed data in sparse linear bandits.
It is widely applicable to high-dimensional multi-agent problems where efficient feature extraction is critical for minimizing regret.
arXiv Detail & Related papers (2023-05-30T16:05:44Z) - Efficient List-Decodable Regression using Batches [33.300073775123835]
We begin the study of list-decodable linear regression using batches.
In this setting only an $alpha in (0,1]$ fraction of the batches are genuine.
Each genuine batch contains $ge n$ i.i.d samples from a common unknown distribution and the remaining batches may contain arbitrary or even adversarial samples.
arXiv Detail & Related papers (2022-11-23T07:08:00Z) - Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches.
We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes.
Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
arXiv Detail & Related papers (2022-10-15T09:22:22Z) - Almost Optimal Batch-Regret Tradeoff for Batch Linear Contextual Bandits [45.43968161616453]
We study the optimal batch-regret tradeoff for batch linear contextual bandits.
We prove its regret guarantee, which features a two-phase expression as the time horizon $T$ grows.
We also prove a new matrix inequality concentration with dependence on their dynamic upper bounds.
arXiv Detail & Related papers (2021-10-15T12:32:33Z) - Thresholded Lasso Bandit [70.17389393497125]
Thresholded Lasso bandit is an algorithm that estimates the vector defining the reward function as well as its sparse support.
We establish non-asymptotic regret upper bounds scaling as $mathcalO( log d + sqrtT )$ in general, and as $mathcalO( log d + sqrtT )$ under the so-called margin condition.
arXiv Detail & Related papers (2020-10-22T19:14:37Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - An Efficient Algorithm For Generalized Linear Bandit: Online Stochastic
Gradient Descent and Thompson Sampling [83.48992319018147]
We consider the contextual bandit problem, where a player sequentially makes decisions based on past observations to maximize the cumulative reward.
A natural way to resolve this problem is to apply online gradient descent (SGD) so that the per-step time and memory complexity can be reduced to constant.
In this work, we show that online SGD can be applied to the generalized linear bandit problem.
The proposed SGD-TS algorithm, which uses a single-step SGD update to exploit past information, achieves $tildeO(sqrtT)$ regret with the total time complexity that
arXiv Detail & Related papers (2020-06-07T01:12:39Z) - Optimal Contextual Pricing and Extensions [32.152826900874075]
We give a poly-time algorithm for contextual pricing with $O(d log log T)$ regret which matches the $Omega(d log T)$ lower bound up to the $d log d$ additive factor.
These algorithms are based on a novel technique of bounding the value of the Steiner intrinsic of a convex region at various scales.
arXiv Detail & Related papers (2020-03-03T18:46:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.