Finite-Time Analysis of Round-Robin Kullback-Leibler Upper Confidence
Bounds for Optimal Adaptive Allocation with Multiple Plays and Markovian
Rewards
- URL: http://arxiv.org/abs/2001.11201v2
- Date: Mon, 13 Jul 2020 18:28:37 GMT
- Title: Finite-Time Analysis of Round-Robin Kullback-Leibler Upper Confidence
Bounds for Optimal Adaptive Allocation with Multiple Plays and Markovian
Rewards
- Authors: Vrettos Moulos
- Abstract summary: We study an extension of the classic multi-armed bandit problem which involves multiple plays and Markovian rewards in the rested bandits setting.
In order to tackle this problem we consider an adaptive allocation rule which at each stage combines the information from the sample means of all the arms, with the Kullback-Leibler upper confidence bound of a single arm which is selected in round-robin way.
- Score: 10.66048003460524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study an extension of the classic stochastic multi-armed bandit problem
which involves multiple plays and Markovian rewards in the rested bandits
setting. In order to tackle this problem we consider an adaptive allocation
rule which at each stage combines the information from the sample means of all
the arms, with the Kullback-Leibler upper confidence bound of a single arm
which is selected in round-robin way. For rewards generated from a
one-parameter exponential family of Markov chains, we provide a finite-time
upper bound for the regret incurred from this adaptive allocation rule, which
reveals the logarithmic dependence of the regret on the time horizon, and which
is asymptotically optimal. For our analysis we devise several concentration
results for Markov chains, including a maximal inequality for Markov chains,
that may be of interest in their own right. As a byproduct of our analysis we
also establish asymptotically optimal, finite-time guarantees for the case of
multiple plays, and i.i.d. rewards drawn from a one-parameter exponential
family of probability densities. Additionally, we provide simulation results
that illustrate that calculating Kullback-Leibler upper confidence bounds in a
round-robin way, is significantly more efficient than calculating them for
every arm at each round, and that the expected regrets of those two approaches
behave similarly.
Related papers
- Best Arm Identification with Minimal Regret [55.831935724659175]
Best arm identification problem elegantly amalgamates regret minimization and BAI.
Agent's goal is to identify the best arm with a prescribed confidence level.
Double KL-UCB algorithm achieves optimality as the confidence level tends to zero.
arXiv Detail & Related papers (2024-09-27T16:46:02Z) - Best Arm Identification with Fixed Budget: A Large Deviation Perspective [54.305323903582845]
We present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms.
In particular, we present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms.
arXiv Detail & Related papers (2023-12-19T13:17:43Z) - Stochastic-Constrained Stochastic Optimization with Markovian Data [2.1756081703276]
We study the setting where data samples are drawn from a Markov chain and thus are not independent and identically distributed.
We propose two variants of drift-plus-penalty; one is for the case when the mixing time of the underlying Markov chain is known.
Our algorithms apply to a more general setting of constrained online convex optimization where the sequence of constraint functions follows a Markov chain.
arXiv Detail & Related papers (2023-12-07T14:09:27Z) - Pure Exploration under Mediators' Feedback [63.56002444692792]
Multi-armed bandits are a sequential-decision-making framework, where, at each interaction step, the learner selects an arm and observes a reward.
We consider the scenario in which the learner has access to a set of mediators, each of which selects the arms on the agent's behalf according to a and possibly unknown policy.
We propose a sequential decision-making strategy for discovering the best arm under the assumption that the mediators' policies are known to the learner.
arXiv Detail & Related papers (2023-08-29T18:18:21Z) - Beyond the Best: Estimating Distribution Functionals in Infinite-Armed
Bandits [40.71199236098642]
In the infinite-armed bandit problem, each arm's average reward is sampled from an unknown distribution.
We consider a general class of distribution functionals beyond the maximum, and propose unified meta algorithms for both the offline and online settings.
arXiv Detail & Related papers (2022-11-01T18:20:10Z) - Finite-Time Regret of Thompson Sampling Algorithms for Exponential
Family Multi-Armed Bandits [88.21288104408556]
We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family.
We propose a Thompson sampling, termed Expulli, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm.
arXiv Detail & Related papers (2022-06-07T18:08:21Z) - From Optimality to Robustness: Dirichlet Sampling Strategies in
Stochastic Bandits [0.0]
We study a generic Dirichlet Sampling (DS) algorithm, based on pairwise comparisons of empirical indices computed with re-sampling of the arms' observations.
We show that different variants of this strategy achieve provably optimal regret guarantees when the distributions are bounded and logarithmic regret for semi-bounded distributions with a mild quantile condition.
arXiv Detail & Related papers (2021-11-18T14:34:21Z) - Mean-based Best Arm Identification in Stochastic Bandits under Reward
Contamination [80.53485617514707]
This paper proposes two algorithms, a gap-based algorithm and one based on the successive elimination, for best arm identification in sub-Gaussian bandits.
Specifically, for the gap-based algorithm, the sample complexity is optimal up to constant factors, while for the successive elimination, it is optimal up to logarithmic factors.
arXiv Detail & Related papers (2021-11-14T21:49:58Z) - Navigating to the Best Policy in Markov Decision Processes [68.8204255655161]
We investigate the active pure exploration problem in Markov Decision Processes.
Agent sequentially selects actions and, from the resulting system trajectory, aims at the best as fast as possible.
arXiv Detail & Related papers (2021-06-05T09:16:28Z) - Combinatorial Blocking Bandits with Stochastic Delays [33.65025386998747]
Recent work has considered natural variations of the multi-armed bandit problem, where the reward of each arm is a special function of the time passed since its last pulling.
In this work, we extend the above model in two directions: (i) We consider the general setting where more than one arms can be played at each round, subject to feasibility constraints.
We provide a tight analysis of the approximation of a natural greedy subset that always plays the maximum expected reward feasible among the available (non-blocked) arms.
When the arms' expected rewards are unknown, we adapt the above algorithm into a bandit, based on
arXiv Detail & Related papers (2021-05-22T02:46:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.