Context-lumpable stochastic bandits
- URL: http://arxiv.org/abs/2306.13053v2
- Date: Tue, 28 Nov 2023 00:53:55 GMT
- Title: Context-lumpable stochastic bandits
- Authors: Chung-Wei Lee, Qinghua Liu, Yasin Abbasi-Yadkori, Chi Jin, Tor
Lattimore, Csaba Szepesv\'ari
- Abstract summary: We consider a contextual bandit problem with $S$ contexts and $K$ actions.
We give an algorithm that outputs an $epsilon$-optimal policy after using at most $widetilde O(r (S +K )/epsilon2)$ samples.
In the regret setting, we give an algorithm whose cumulative regret up to time $T$ is bounded by $widetilde O(sqrtr3(S+K)T)$.
- Score: 49.024050919419366
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider a contextual bandit problem with $S$ contexts and $K$ actions. In
each round $t=1,2,\dots$, the learner observes a random context and chooses an
action based on its past experience. The learner then observes a random reward
whose mean is a function of the context and the action for the round. Under the
assumption that the contexts can be lumped into $r\le \min\{S,K\}$ groups such
that the mean reward for the various actions is the same for any two contexts
that are in the same group, we give an algorithm that outputs an
$\epsilon$-optimal policy after using at most $\widetilde O(r (S +K
)/\epsilon^2)$ samples with high probability and provide a matching
$\Omega(r(S+K)/\epsilon^2)$ lower bound. In the regret minimization setting, we
give an algorithm whose cumulative regret up to time $T$ is bounded by
$\widetilde O(\sqrt{r^3(S+K)T})$. To the best of our knowledge, we are the
first to show the near-optimal sample complexity in the PAC setting and
$\widetilde O(\sqrt{{poly}(r)(S+K)T})$ minimax regret in the online setting for
this problem. We also show our algorithms can be applied to more general
low-rank bandits and get improved regret bounds in some scenarios.
Related papers
- Low-Rank Bandits via Tight Two-to-Infinity Singular Subspace Recovery [45.601316850669406]
We present efficient algorithms for policy evaluation, best policy identification and regret minimization.
For policy evaluation and best policy identification, we show that our algorithms are nearly minimax optimal.
All the proposed algorithms consist of two phases: they first leverage spectral methods to estimate the left and right singular subspaces of the low-rank reward matrix.
arXiv Detail & Related papers (2024-02-24T06:36:08Z) - First- and Second-Order Bounds for Adversarial Linear Contextual Bandits [22.367921675238318]
We consider the adversarial linear contextual bandit setting, which allows for the loss functions associated with each of $K$ arms to change over time without restriction.
Since $V_T$ or $L_T*$ may be significantly smaller than $T$, these improve over the worst-case regret whenever the environment is relatively benign.
arXiv Detail & Related papers (2023-05-01T14:00:15Z) - Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$.
We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes.
We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z) - Instance-optimal PAC Algorithms for Contextual Bandits [20.176752818200438]
In this work, we focus on the bandit problem in the $(epsilon,delta)$-$textitPAC$ setting.
We show that no algorithm can be simultaneously minimax-optimal regret minimization and instance-dependent PAC for best-arm identification.
arXiv Detail & Related papers (2022-07-05T23:19:43Z) - Corralling a Larger Band of Bandits: A Case Study on Switching Regret
for Linear Bandits [99.86860277006318]
We consider the problem of combining and learning over a set of adversarial algorithms with the goal of adaptively tracking the best one on the fly.
The CORRAL of Agarwal et al. achieves this goal with a regret overhead of order $widetildeO(sqrtd S T)$ where $M$ is the number of base algorithms and $T$ is the time horizon.
Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only emphlogarithmic dependence on $M$ as long
arXiv Detail & Related papers (2022-02-12T21:55:44Z) - On Submodular Contextual Bandits [92.45432756301231]
We consider the problem of contextual bandits where actions are subsets of a ground set and mean rewards are modeled by an unknown monotone submodular function.
We show that our algorithm efficiently randomizes around local optima of estimated functions according to the Inverse Gap Weighting strategy.
arXiv Detail & Related papers (2021-12-03T21:42:33Z) - Impact of Representation Learning in Linear Bandits [83.17684841392754]
We study how representation learning can improve the efficiency of bandit problems.
We present a new algorithm which achieves $widetildeO(TsqrtkN + sqrtdkNT)$ regret, where $N$ is the number of rounds we play for each bandit.
arXiv Detail & Related papers (2020-10-13T16:35:30Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.