Finite-Time Analysis of Round-Robin Kullback-Leibler Upper Confidence
Bounds for Optimal Adaptive Allocation with Multiple Plays and Markovian
Rewards
- URL: http://arxiv.org/abs/2001.11201v2
- Date: Mon, 13 Jul 2020 18:28:37 GMT
- Title: Finite-Time Analysis of Round-Robin Kullback-Leibler Upper Confidence
Bounds for Optimal Adaptive Allocation with Multiple Plays and Markovian
Rewards
- Authors: Vrettos Moulos
- Abstract summary: We study an extension of the classic multi-armed bandit problem which involves multiple plays and Markovian rewards in the rested bandits setting.
In order to tackle this problem we consider an adaptive allocation rule which at each stage combines the information from the sample means of all the arms, with the Kullback-Leibler upper confidence bound of a single arm which is selected in round-robin way.
- Score: 10.66048003460524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study an extension of the classic stochastic multi-armed bandit problem
which involves multiple plays and Markovian rewards in the rested bandits
setting. In order to tackle this problem we consider an adaptive allocation
rule which at each stage combines the information from the sample means of all
the arms, with the Kullback-Leibler upper confidence bound of a single arm
which is selected in round-robin way. For rewards generated from a
one-parameter exponential family of Markov chains, we provide a finite-time
upper bound for the regret incurred from this adaptive allocation rule, which
reveals the logarithmic dependence of the regret on the time horizon, and which
is asymptotically optimal. For our analysis we devise several concentration
results for Markov chains, including a maximal inequality for Markov chains,
that may be of interest in their own right. As a byproduct of our analysis we
also establish asymptotically optimal, finite-time guarantees for the case of
multiple plays, and i.i.d. rewards drawn from a one-parameter exponential
family of probability densities. Additionally, we provide simulation results
that illustrate that calculating Kullback-Leibler upper confidence bounds in a
round-robin way, is significantly more efficient than calculating them for
every arm at each round, and that the expected regrets of those two approaches
behave similarly.
Related papers
- Continuous K-Max Bandits [54.21533414838677]
We study the $K$-Max multi-armed bandits problem with continuous outcome distributions and weak value-index feedback.
This setting captures critical applications in recommendation systems, distributed computing, server scheduling, etc.
Our key contribution is the computationally efficient algorithm DCK-UCB, which combines adaptive discretization with bias-corrected confidence bounds.
arXiv Detail & Related papers (2025-02-19T06:37:37Z) - Optimal Multi-Objective Best Arm Identification with Fixed Confidence [62.36929749450298]
We consider a multi-armed bandit setting in which each arm yields an $M$-dimensional vector reward upon selection.
The end goal is to identify the best arm of em every objective in the shortest (expected) time subject to an upper bound on the probability of error.
We propose an algorithm that uses the novel idea of em surrogate proportions to sample the arms at each time step, eliminating the need to solve the max-min optimisation problem at each step.
arXiv Detail & Related papers (2025-01-23T12:28:09Z) - Best Arm Identification with Minimal Regret [55.831935724659175]
Best arm identification problem elegantly amalgamates regret minimization and BAI.
Agent's goal is to identify the best arm with a prescribed confidence level.
Double KL-UCB algorithm achieves optimality as the confidence level tends to zero.
arXiv Detail & Related papers (2024-09-27T16:46:02Z) - Best Arm Identification with Fixed Budget: A Large Deviation Perspective [54.305323903582845]
We present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms.
In particular, we present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms.
arXiv Detail & Related papers (2023-12-19T13:17:43Z) - Stochastic-Constrained Stochastic Optimization with Markovian Data [2.1756081703276]
We study the setting where data samples are drawn from a Markov chain and thus are not independent and identically distributed.
We propose two variants of drift-plus-penalty; one is for the case when the mixing time of the underlying Markov chain is known.
Our algorithms apply to a more general setting of constrained online convex optimization where the sequence of constraint functions follows a Markov chain.
arXiv Detail & Related papers (2023-12-07T14:09:27Z) - Beyond the Best: Estimating Distribution Functionals in Infinite-Armed
Bandits [40.71199236098642]
In the infinite-armed bandit problem, each arm's average reward is sampled from an unknown distribution.
We consider a general class of distribution functionals beyond the maximum, and propose unified meta algorithms for both the offline and online settings.
arXiv Detail & Related papers (2022-11-01T18:20:10Z) - Finite-Time Regret of Thompson Sampling Algorithms for Exponential
Family Multi-Armed Bandits [88.21288104408556]
We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family.
We propose a Thompson sampling, termed Expulli, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm.
arXiv Detail & Related papers (2022-06-07T18:08:21Z) - From Optimality to Robustness: Dirichlet Sampling Strategies in
Stochastic Bandits [0.0]
We study a generic Dirichlet Sampling (DS) algorithm, based on pairwise comparisons of empirical indices computed with re-sampling of the arms' observations.
We show that different variants of this strategy achieve provably optimal regret guarantees when the distributions are bounded and logarithmic regret for semi-bounded distributions with a mild quantile condition.
arXiv Detail & Related papers (2021-11-18T14:34:21Z) - Navigating to the Best Policy in Markov Decision Processes [68.8204255655161]
We investigate the active pure exploration problem in Markov Decision Processes.
Agent sequentially selects actions and, from the resulting system trajectory, aims at the best as fast as possible.
arXiv Detail & Related papers (2021-06-05T09:16:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.