Combinatorial Bandits without Total Order for Arms
- URL: http://arxiv.org/abs/2103.02741v1
- Date: Wed, 3 Mar 2021 23:08:59 GMT
- Title: Combinatorial Bandits without Total Order for Arms
- Authors: Shuo Yang, Tongzheng Ren, Inderjit S. Dhillon, Sujay Sanghavi
- Abstract summary: We present a reward model that captures set-dependent reward distribution and assumes no total order for arms.
We develop a novel regret analysis and show an $Oleft(frack2 n log Tepsilonright)$ gap-dependent regret bound as well as an $Oleft(k2sqrtn T log Tright)$ gap-independent regret bound.
- Score: 52.93972547896022
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider the combinatorial bandits problem, where at each time step, the
online learner selects a size-$k$ subset $s$ from the arms set $\mathcal{A}$,
where $\left|\mathcal{A}\right| = n$, and observes a stochastic reward of each
arm in the selected set $s$. The goal of the online learner is to minimize the
regret, induced by not selecting $s^*$ which maximizes the expected total
reward. Specifically, we focus on a challenging setting where 1) the reward
distribution of an arm depends on the set $s$ it is part of, and crucially 2)
there is \textit{no total order} for the arms in $\mathcal{A}$.
In this paper, we formally present a reward model that captures set-dependent
reward distribution and assumes no total order for arms. Correspondingly, we
propose an Upper Confidence Bound (UCB) algorithm that maintains UCB for each
individual arm and selects the arms with top-$k$ UCB. We develop a novel regret
analysis and show an $O\left(\frac{k^2 n \log T}{\epsilon}\right)$
gap-dependent regret bound as well as an $O\left(k^2\sqrt{n T \log T}\right)$
gap-independent regret bound. We also provide a lower bound for the proposed
reward model, which shows our proposed algorithm is near-optimal for any
constant $k$. Empirical results on various reward models demonstrate the broad
applicability of our algorithm.
Related papers
- A General Framework for Clustering and Distribution Matching with Bandit Feedback [81.50716021326194]
We develop a general framework for clustering and distribution matching problems with bandit feedback.
We derive a non-asymptotic lower bound on the average number of arm pulls for any online algorithm with an error probability not exceeding $delta$.
arXiv Detail & Related papers (2024-09-08T12:19:12Z) - Combinatorial Stochastic-Greedy Bandit [79.1700188160944]
We propose a novelgreedy bandit (SGB) algorithm for multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time $tin [T]$ is observed.
SGB adopts an optimized-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms.
arXiv Detail & Related papers (2023-12-13T11:08:25Z) - Best-Arm Identification in Correlated Multi-Armed Bandits [9.180690233707645]
We propose a novel correlated bandit framework that captures domain knowledge about correlation between arms.
We show that the total samples obtained by C-LUCB are of the form $mathcalOleft(sum_k in mathcalC logleft(frac1deltaright)$ as opposed to the typical $mathcalOleft(sum_k in mathcalC logleft(frac1deltaright)$ samples required in the independent reward setting
arXiv Detail & Related papers (2021-09-10T15:41:07Z) - Variance-Dependent Best Arm Identification [12.058652922228918]
We study the problem of identifying the best arm in a multi-armed bandit game.
We propose an adaptive algorithm which explores the gaps and variances of the rewards of the arms.
We show that our algorithm is optimal up to doubly logarithmic terms.
arXiv Detail & Related papers (2021-06-19T04:13:54Z) - Bandits with many optimal arms [68.17472536610859]
We write $p*$ for the proportion of optimal arms and $Delta$ for the minimal mean-gap between optimal and sub-optimal arms.
We characterize the optimal learning rates both in the cumulative regret setting, and in the best-arm identification setting.
arXiv Detail & Related papers (2021-03-23T11:02:31Z) - Top-$k$ eXtreme Contextual Bandits with Arm Hierarchy [71.17938026619068]
We study the top-$k$ extreme contextual bandits problem, where the total number of arms can be enormous.
We first propose an algorithm for the non-extreme realizable setting, utilizing the Inverse Gap Weighting strategy.
We show that our algorithm has a regret guarantee of $O(ksqrt(A-k+1)T log (|mathcalF|T))$.
arXiv Detail & Related papers (2021-02-15T19:10:52Z) - Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits
with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints.
For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z) - Blocking Bandits [33.14975454724348]
We consider a novel multi-armed bandit setting, where playing an arm makes it unavailable for a fixed number of time slots thereafter.
We show that with prior knowledge of the rewards and delays of all the arms, the problem of optimizing cumulative reward does not admit any pseudo-polynomial time algorithm.
We design a UCB based algorithm which is shown to have $c log T + o(log T)$ cumulative regret against the greedy algorithm.
arXiv Detail & Related papers (2019-07-27T20:42:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.