Non-stationary Bandits and Meta-Learning with a Small Set of Optimal
Arms
- URL: http://arxiv.org/abs/2202.13001v1
- Date: Fri, 25 Feb 2022 22:28:01 GMT
- Title: Non-stationary Bandits and Meta-Learning with a Small Set of Optimal
Arms
- Authors: MohammadJavad Azizi, Thang Duong, Yasin Abbasi-Yadkori, Andr\'as
Gy\"orgy, Claire Vernade, Mohammad Ghavamzadeh
- Abstract summary: We study a decision where a learner faces a sequence of $K$-armed bandit tasks.
The adversary is constrained to choose the optimal arm of each task in a smaller (but unknown) subset of $M$ arms.
The boundaries might be known (the bandit meta-learning setting), or unknown (the non-stationary bandit setting)
- Score: 30.024167992890916
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study a sequential decision problem where the learner faces a sequence of
$K$-armed stochastic bandit tasks. The tasks may be designed by an adversary,
but the adversary is constrained to choose the optimal arm of each task in a
smaller (but unknown) subset of $M$ arms. The task boundaries might be known
(the bandit meta-learning setting), or unknown (the non-stationary bandit
setting), and the number of tasks $N$ as well as the total number of rounds $T$
are known ($N$ could be unknown in the meta-learning setting). We design an
algorithm based on a reduction to bandit submodular maximization, and show that
its regret in both settings is smaller than the simple baseline of
$\tilde{O}(\sqrt{KNT})$ that can be obtained by using standard algorithms
designed for non-stationary bandit problems. For the bandit meta-learning
problem with fixed task length $\tau$, we show that the regret of the algorithm
is bounded as $\tilde{O}(N\sqrt{M \tau}+N^{2/3})$. Under additional assumptions
on the identifiability of the optimal arms in each task, we show a bandit
meta-learning algorithm with an improved $\tilde{O}(N\sqrt{M \tau}+N^{1/2})$
regret.
Related papers
- Low-Rank Bandits via Tight Two-to-Infinity Singular Subspace Recovery [45.601316850669406]
We present efficient algorithms for policy evaluation, best policy identification and regret minimization.
For policy evaluation and best policy identification, we show that our algorithms are nearly minimax optimal.
All the proposed algorithms consist of two phases: they first leverage spectral methods to estimate the left and right singular subspaces of the low-rank reward matrix.
arXiv Detail & Related papers (2024-02-24T06:36:08Z) - Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$.
We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes.
We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z) - Nearly Minimax Algorithms for Linear Bandits with Shared Representation [86.79657561369397]
We consider the setting where we play $M$ linear bandits with dimension $d$, each for $T$ rounds, and these $M$ bandit tasks share a common $k(ll d)$ dimensional linear representation.
We come up with novel algorithms that achieve $widetildeOleft(dsqrtkMT + kMsqrtTright)$ regret bounds, which matches the known minimax regret lower bound up to logarithmic factors.
arXiv Detail & Related papers (2022-03-29T15:27:13Z) - Meta-Learning for Simple Regret Minimization [34.75895755834204]
We develop a meta-learning framework for simple regret in bandits.
We propose the first Bayesian and frequentist meta-learning algorithms for this setting.
arXiv Detail & Related papers (2022-02-25T18:56:54Z) - Corralling a Larger Band of Bandits: A Case Study on Switching Regret
for Linear Bandits [99.86860277006318]
We consider the problem of combining and learning over a set of adversarial algorithms with the goal of adaptively tracking the best one on the fly.
The CORRAL of Agarwal et al. achieves this goal with a regret overhead of order $widetildeO(sqrtd S T)$ where $M$ is the number of base algorithms and $T$ is the time horizon.
Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only emphlogarithmic dependence on $M$ as long
arXiv Detail & Related papers (2022-02-12T21:55:44Z) - No Regrets for Learning the Prior in Bandits [30.478188057004175]
$tt AdaTS$ is a Thompson sampling algorithm that adapts sequentially to bandit tasks.
$tt AdaTS$ is a fully-Bayesian algorithm that can be implemented efficiently in several classes of bandit problems.
arXiv Detail & Related papers (2021-07-13T15:51:32Z) - Bandits with many optimal arms [68.17472536610859]
We write $p*$ for the proportion of optimal arms and $Delta$ for the minimal mean-gap between optimal and sub-optimal arms.
We characterize the optimal learning rates both in the cumulative regret setting, and in the best-arm identification setting.
arXiv Detail & Related papers (2021-03-23T11:02:31Z) - Impact of Representation Learning in Linear Bandits [83.17684841392754]
We study how representation learning can improve the efficiency of bandit problems.
We present a new algorithm which achieves $widetildeO(TsqrtkN + sqrtdkNT)$ regret, where $N$ is the number of rounds we play for each bandit.
arXiv Detail & Related papers (2020-10-13T16:35:30Z) - Model Selection in Contextual Stochastic Bandit Problems [51.94632035240787]
We develop a meta-algorithm that selects between base algorithms.
We show through a lower bound that even when one of the base algorithms has $O(sqrtT)$ regret, in general it is impossible to get better than $Omega(sqrtT)$ regret.
arXiv Detail & Related papers (2020-03-03T18:46:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.