Related papers: Combinatorial Bandits without Total Order for Arms

Combinatorial Bandits without Total Order for Arms

URL: http://arxiv.org/abs/2103.02741v1
Date: Wed, 3 Mar 2021 23:08:59 GMT
Title: Combinatorial Bandits without Total Order for Arms
Authors: Shuo Yang, Tongzheng Ren, Inderjit S. Dhillon, Sujay Sanghavi
Abstract summary: We present a reward model that captures set-dependent reward distribution and assumes no total order for arms. We develop a novel regret analysis and show an $Oleft(frack2 n log Tepsilonright)$ gap-dependent regret bound as well as an $Oleft(k2sqrtn T log Tright)$ gap-independent regret bound.
Score: 52.93972547896022
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We consider the combinatorial bandits problem, where at each time step, the online learner selects a size-$k$ subset $s$ from the arms set $\mathcal{A}$, where $\left|\mathcal{A}\right| = n$, and observes a stochastic reward of each arm in the selected set $s$. The goal of the online learner is to minimize the regret, induced by not selecting $s^*$ which maximizes the expected total reward. Specifically, we focus on a challenging setting where 1) the reward distribution of an arm depends on the set $s$ it is part of, and crucially 2) there is \textit{no total order} for the arms in $\mathcal{A}$. In this paper, we formally present a reward model that captures set-dependent reward distribution and assumes no total order for arms. Correspondingly, we propose an Upper Confidence Bound (UCB) algorithm that maintains UCB for each individual arm and selects the arms with top-$k$ UCB. We develop a novel regret analysis and show an $O\left(\frac{k^2 n \log T}{\epsilon}\right)$ gap-dependent regret bound as well as an $O\left(k^2\sqrt{n T \log T}\right)$ gap-independent regret bound. We also provide a lower bound for the proposed reward model, which shows our proposed algorithm is near-optimal for any constant $k$. Empirical results on various reward models demonstrate the broad applicability of our algorithm.

Related papers

Adversarial Combinatorial Semi-bandits with Graph Feedback [0.0]
We show that the optimal regret over a time horizon $T$ scales as $widetildeTheta(SsqrtT+sqrtalpha ST)$. A key technical ingredient is to realize a convexified action using a random decision vector with negative correlations.
arXiv Detail & Related papers (2025-02-26T05:01:11Z)
A General Framework for Clustering and Distribution Matching with Bandit Feedback [81.50716021326194]
We develop a general framework for clustering and distribution matching problems with bandit feedback. We derive a non-asymptotic lower bound on the average number of arm pulls for any online algorithm with an error probability not exceeding $delta$.
arXiv Detail & Related papers (2024-09-08T12:19:12Z)
Combinatorial Stochastic-Greedy Bandit [79.1700188160944]
We propose a novelgreedy bandit (SGB) algorithm for multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time $tin [T]$ is observed. SGB adopts an optimized-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms.
arXiv Detail & Related papers (2023-12-13T11:08:25Z)
Best-Arm Identification in Correlated Multi-Armed Bandits [9.180690233707645]
We propose a novel correlated bandit framework that captures domain knowledge about correlation between arms. We show that the total samples obtained by C-LUCB are of the form $mathcalOleft(sum_k in mathcalC logleft(frac1deltaright)$ as opposed to the typical $mathcalOleft(sum_k in mathcalC logleft(frac1deltaright)$ samples required in the independent reward setting
arXiv Detail & Related papers (2021-09-10T15:41:07Z)
Variance-Dependent Best Arm Identification [12.058652922228918]
We study the problem of identifying the best arm in a multi-armed bandit game. We propose an adaptive algorithm which explores the gaps and variances of the rewards of the arms. We show that our algorithm is optimal up to doubly logarithmic terms.
arXiv Detail & Related papers (2021-06-19T04:13:54Z)
Bandits with many optimal arms [68.17472536610859]
We write $p*$ for the proportion of optimal arms and $Delta$ for the minimal mean-gap between optimal and sub-optimal arms. We characterize the optimal learning rates both in the cumulative regret setting, and in the best-arm identification setting.
arXiv Detail & Related papers (2021-03-23T11:02:31Z)
Top-$k$ eXtreme Contextual Bandits with Arm Hierarchy [71.17938026619068]
We study the top-$k$ extreme contextual bandits problem, where the total number of arms can be enormous. We first propose an algorithm for the non-extreme realizable setting, utilizing the Inverse Gap Weighting strategy. We show that our algorithm has a regret guarantee of $O(ksqrt(A-k+1)T log (|mathcalF|T))$.
arXiv Detail & Related papers (2021-02-15T19:10:52Z)
Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints. For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z)
Blocking Bandits [33.14975454724348]
We consider a novel multi-armed bandit setting, where playing an arm makes it unavailable for a fixed number of time slots thereafter. We show that with prior knowledge of the rewards and delays of all the arms, the problem of optimizing cumulative reward does not admit any pseudo-polynomial time algorithm. We design a UCB based algorithm which is shown to have $c log T + o(log T)$ cumulative regret against the greedy algorithm.
arXiv Detail & Related papers (2019-07-27T20:42:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.