Meta-Learning for Simple Regret Minimization
- URL: http://arxiv.org/abs/2202.12888v2
- Date: Tue, 4 Jul 2023 20:01:11 GMT
- Title: Meta-Learning for Simple Regret Minimization
- Authors: Mohammadjavad Azizi, Branislav Kveton, Mohammad Ghavamzadeh, Sumeet
Katariya
- Abstract summary: We develop a meta-learning framework for simple regret in bandits.
We propose the first Bayesian and frequentist meta-learning algorithms for this setting.
- Score: 34.75895755834204
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We develop a meta-learning framework for simple regret minimization in
bandits. In this framework, a learning agent interacts with a sequence of
bandit tasks, which are sampled i.i.d.\ from an unknown prior distribution, and
learns its meta-parameters to perform better on future tasks. We propose the
first Bayesian and frequentist meta-learning algorithms for this setting. The
Bayesian algorithm has access to a prior distribution over the meta-parameters
and its meta simple regret over $m$ bandit tasks with horizon $n$ is mere
$\tilde{O}(m / \sqrt{n})$. On the other hand, the meta simple regret of the
frequentist algorithm is $\tilde{O}(\sqrt{m} n + m/ \sqrt{n})$. While its
regret is worse, the frequentist algorithm is more general because it does not
need a prior distribution over the meta-parameters. It can also be analyzed in
more settings. We instantiate our algorithms for several classes of bandit
problems. Our algorithms are general and we complement our theory by evaluating
them empirically in several environments.
Related papers
- Low-Rank Bandits via Tight Two-to-Infinity Singular Subspace Recovery [45.601316850669406]
We present efficient algorithms for policy evaluation, best policy identification and regret minimization.
For policy evaluation and best policy identification, we show that our algorithms are nearly minimax optimal.
All the proposed algorithms consist of two phases: they first leverage spectral methods to estimate the left and right singular subspaces of the low-rank reward matrix.
arXiv Detail & Related papers (2024-02-24T06:36:08Z) - Learning Arithmetic Formulas in the Presence of Noise: A General
Framework and Applications to Unsupervised Learning [4.10375234787249]
We present a framework for designing efficient algorithms for unsupervised learning problems.
Our framework is based on a meta algorithm that learns arithmetic circuits in the presence of noise.
arXiv Detail & Related papers (2023-11-13T12:26:25Z) - Revisiting Weighted Strategy for Non-stationary Parametric Bandits [82.1942459195896]
This paper revisits the weighted strategy for non-stationary parametric bandits.
We propose a refined analysis framework, which produces a simpler weight-based algorithm.
Our new framework can be used to improve regret bounds of other parametric bandits.
arXiv Detail & Related papers (2023-03-05T15:11:14Z) - Non-stationary Bandits and Meta-Learning with a Small Set of Optimal
Arms [30.024167992890916]
We study a decision where a learner faces a sequence of $K$-armed bandit tasks.
The adversary is constrained to choose the optimal arm of each task in a smaller (but unknown) subset of $M$ arms.
The boundaries might be known (the bandit meta-learning setting), or unknown (the non-stationary bandit setting)
arXiv Detail & Related papers (2022-02-25T22:28:01Z) - Corralling a Larger Band of Bandits: A Case Study on Switching Regret
for Linear Bandits [99.86860277006318]
We consider the problem of combining and learning over a set of adversarial algorithms with the goal of adaptively tracking the best one on the fly.
The CORRAL of Agarwal et al. achieves this goal with a regret overhead of order $widetildeO(sqrtd S T)$ where $M$ is the number of base algorithms and $T$ is the time horizon.
Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only emphlogarithmic dependence on $M$ as long
arXiv Detail & Related papers (2022-02-12T21:55:44Z) - Gaussian Process Bandit Optimization with Few Batches [49.896920704012395]
We introduce a batch algorithm inspired by finite-arm bandit algorithms.
We show that it achieves the cumulative regret upper bound $Oast(sqrtTgamma_T)$ using $O(loglog T)$ batches within time horizon $T$.
In addition, we propose a modified version of our algorithm, and characterize how the regret is impacted by the number of batches.
arXiv Detail & Related papers (2021-10-15T00:54:04Z) - No Regrets for Learning the Prior in Bandits [30.478188057004175]
$tt AdaTS$ is a Thompson sampling algorithm that adapts sequentially to bandit tasks.
$tt AdaTS$ is a fully-Bayesian algorithm that can be implemented efficiently in several classes of bandit problems.
arXiv Detail & Related papers (2021-07-13T15:51:32Z) - Upper Confidence Bounds for Combining Stochastic Bandits [52.10197476419621]
We provide a simple method to combine bandit algorithms.
Our approach is based on a "meta-UCB" procedure that treats each of $N$ individual bandit algorithms as arms in a higher-level $N$-armed bandit problem.
arXiv Detail & Related papers (2020-12-24T05:36:29Z) - Model Selection in Contextual Stochastic Bandit Problems [51.94632035240787]
We develop a meta-algorithm that selects between base algorithms.
We show through a lower bound that even when one of the base algorithms has $O(sqrtT)$ regret, in general it is impossible to get better than $Omega(sqrtT)$ regret.
arXiv Detail & Related papers (2020-03-03T18:46:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.