Regret Bound Balancing and Elimination for Model Selection in Bandits
and RL
- URL: http://arxiv.org/abs/2012.13045v1
- Date: Thu, 24 Dec 2020 00:53:42 GMT
- Title: Regret Bound Balancing and Elimination for Model Selection in Bandits
and RL
- Authors: Aldo Pacchiano, Christoph Dann, Claudio Gentile, Peter Bartlett
- Abstract summary: We propose a simple model selection approach for algorithms in bandit and reinforcement learning problems.
We prove that the total regret of this approach is bounded by the best valid candidate regret bound times a multiplicative factor.
Unlike recent efforts in model selection for linear bandits, our approach is versatile enough to also cover cases where the context information is generated by an adversarial environment.
- Score: 34.15978290083909
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a simple model selection approach for algorithms in stochastic
bandit and reinforcement learning problems. As opposed to prior work that
(implicitly) assumes knowledge of the optimal regret, we only require that each
base algorithm comes with a candidate regret bound that may or may not hold
during all rounds. In each round, our approach plays a base algorithm to keep
the candidate regret bounds of all remaining base algorithms balanced, and
eliminates algorithms that violate their candidate bound. We prove that the
total regret of this approach is bounded by the best valid candidate regret
bound times a multiplicative factor. This factor is reasonably small in several
applications, including linear bandits and MDPs with nested function classes,
linear bandits with unknown misspecification, and LinUCB applied to linear
bandits with different confidence parameters. We further show that, under a
suitable gap-assumption, this factor only scales with the number of base
algorithms and not their complexity when the number of rounds is large enough.
Finally, unlike recent efforts in model selection for linear stochastic
bandits, our approach is versatile enough to also cover cases where the context
information is generated by an adversarial environment, rather than a
stochastic one.
Related papers
- Semi-Bandit Learning for Monotone Stochastic Optimization [20.776114616154242]
We provide a generic online learning algorithm for a class of "monotone" problems.
Our framework applies to several fundamental problems in optimization such as prophet, Pandora's box knapsack, inequality matchings and submodular optimization.
arXiv Detail & Related papers (2023-12-24T07:46:37Z) - Universal and data-adaptive algorithms for model selection in linear
contextual bandits [52.47796554359261]
We consider the simplest non-trivial instance of model-selection: distinguishing a simple multi-armed bandit problem from a linear contextual bandit problem.
We introduce new algorithms that explore in a data-adaptive manner and provide guarantees of the form $mathcalO(dalpha T1- alpha)$.
Our approach extends to model selection among nested linear contextual bandits under some additional assumptions.
arXiv Detail & Related papers (2021-11-08T18:05:35Z) - Optimal Gradient-based Algorithms for Non-concave Bandit Optimization [76.57464214864756]
This work considers a large family of bandit problems where the unknown underlying reward function is non-concave.
Our algorithms are based on a unified zeroth-order optimization paradigm that applies in great generality.
We show that the standard optimistic algorithms are sub-optimal by dimension factors.
arXiv Detail & Related papers (2021-07-09T16:04:24Z) - Upper Confidence Bounds for Combining Stochastic Bandits [52.10197476419621]
We provide a simple method to combine bandit algorithms.
Our approach is based on a "meta-UCB" procedure that treats each of $N$ individual bandit algorithms as arms in a higher-level $N$-armed bandit problem.
arXiv Detail & Related papers (2020-12-24T05:36:29Z) - Online Model Selection for Reinforcement Learning with Function
Approximation [50.008542459050155]
We present a meta-algorithm that adapts to the optimal complexity with $tildeO(L5/6 T2/3)$ regret.
We also show that the meta-algorithm automatically admits significantly improved instance-dependent regret bounds.
arXiv Detail & Related papers (2020-11-19T10:00:54Z) - Model Selection in Contextual Stochastic Bandit Problems [51.94632035240787]
We develop a meta-algorithm that selects between base algorithms.
We show through a lower bound that even when one of the base algorithms has $O(sqrtT)$ regret, in general it is impossible to get better than $Omega(sqrtT)$ regret.
arXiv Detail & Related papers (2020-03-03T18:46:34Z) - Thompson Sampling Algorithms for Mean-Variance Bandits [97.43678751629189]
We develop Thompson Sampling-style algorithms for mean-variance MAB.
We also provide comprehensive regret analyses for Gaussian and Bernoulli bandits.
Our algorithms significantly outperform existing LCB-based algorithms for all risk tolerances.
arXiv Detail & Related papers (2020-02-01T15:33:50Z) - Bandit algorithms to emulate human decision making using probabilistic
distortions [20.422725678982726]
We formulate two sample multi-armed bandit problems with distorted probabilities on the reward distributions.
We consider the aforementioned problems in the regret minimization as well as best arm identification framework for multi-armed bandits.
arXiv Detail & Related papers (2016-11-30T17:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.