DART: aDaptive Accept RejecT for non-linear top-K subset identification
- URL: http://arxiv.org/abs/2011.07687v1
- Date: Mon, 16 Nov 2020 02:10:06 GMT
- Title: DART: aDaptive Accept RejecT for non-linear top-K subset identification
- Authors: Mridul Agarwal, Vaneet Aggarwal, Christopher J. Quinn, Abhishek
Umrawal
- Abstract summary: We consider the bandit problem of selecting $K$ out of $N$ arms at each time step.
We present a novel algorithm for the setting without using individual arm feedback or requiring linearity of the reward function.
Our algorithm, aDaptive Accept RejecT (DART), sequentially finds good arms and eliminates bad arms based on confidence bounds.
- Score: 34.68931784199599
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider the bandit problem of selecting $K$ out of $N$ arms at each time
step. The reward can be a non-linear function of the rewards of the selected
individual arms. The direct use of a multi-armed bandit algorithm requires
choosing among $\binom{N}{K}$ options, making the action space large. To
simplify the problem, existing works on combinatorial bandits {typically}
assume feedback as a linear function of individual rewards. In this paper, we
prove the lower bound for top-$K$ subset selection with bandit feedback with
possibly correlated rewards. We present a novel algorithm for the combinatorial
setting without using individual arm feedback or requiring linearity of the
reward function. Additionally, our algorithm works on correlated rewards of
individual arms. Our algorithm, aDaptive Accept RejecT (DART), sequentially
finds good arms and eliminates bad arms based on confidence bounds. DART is
computationally efficient and uses storage linear in $N$. Further, DART
achieves a regret bound of $\tilde{\mathcal{O}}(K\sqrt{KNT})$ for a time
horizon $T$, which matches the lower bound in bandit feedback up to a factor of
$\sqrt{\log{2NT}}$. When applied to the problem of cross-selling optimization
and maximizing the mean of individual rewards, the performance of the proposed
algorithm surpasses that of state-of-the-art algorithms. We also show that DART
significantly outperforms existing methods for both linear and non-linear joint
reward environments.
Related papers
- Stochastic Bandits Robust to Adversarial Attacks [33.278131584647745]
This paper investigates multi-armed bandit algorithms that are robust to adversarial attacks.
We study two cases of this model, with or without the knowledge of an attack budget $C$.
We devise two types of algorithms with regret bounds having additive or multiplicative $C$ dependence terms.
arXiv Detail & Related papers (2024-08-16T17:41:35Z) - Combinatorial Stochastic-Greedy Bandit [79.1700188160944]
We propose a novelgreedy bandit (SGB) algorithm for multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time $tin [T]$ is observed.
SGB adopts an optimized-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms.
arXiv Detail & Related papers (2023-12-13T11:08:25Z) - PopArt: Efficient Sparse Regression and Experimental Design for Optimal
Sparse Linear Bandits [29.097522376094624]
We propose a simple and computationally efficient sparse linear estimation method called PopArt.
We derive sparse linear bandit algorithms that enjoy improved regret upper bounds upon the state of the art.
arXiv Detail & Related papers (2022-10-25T19:13:20Z) - Nearly Minimax Algorithms for Linear Bandits with Shared Representation [86.79657561369397]
We consider the setting where we play $M$ linear bandits with dimension $d$, each for $T$ rounds, and these $M$ bandit tasks share a common $k(ll d)$ dimensional linear representation.
We come up with novel algorithms that achieve $widetildeOleft(dsqrtkMT + kMsqrtTright)$ regret bounds, which matches the known minimax regret lower bound up to logarithmic factors.
arXiv Detail & Related papers (2022-03-29T15:27:13Z) - Combinatorial Bandits without Total Order for Arms [52.93972547896022]
We present a reward model that captures set-dependent reward distribution and assumes no total order for arms.
We develop a novel regret analysis and show an $Oleft(frack2 n log Tepsilonright)$ gap-dependent regret bound as well as an $Oleft(k2sqrtn T log Tright)$ gap-independent regret bound.
arXiv Detail & Related papers (2021-03-03T23:08:59Z) - Top-$k$ eXtreme Contextual Bandits with Arm Hierarchy [71.17938026619068]
We study the top-$k$ extreme contextual bandits problem, where the total number of arms can be enormous.
We first propose an algorithm for the non-extreme realizable setting, utilizing the Inverse Gap Weighting strategy.
We show that our algorithm has a regret guarantee of $O(ksqrt(A-k+1)T log (|mathcalF|T))$.
arXiv Detail & Related papers (2021-02-15T19:10:52Z) - Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits
with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints.
For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z) - Adversarial Combinatorial Bandits with General Non-linear Reward
Functions [29.775097827244853]
We study the adversarial bandit with a known non-linear reward function, extending existing work on adversarial linear bandit.
We show that with $N$ arms and subsets of $K$ arms being chosen at each of $T$ time periods, the minimax optimal regret is $widetildeTheta_d(Nd T)$ if the reward function is a $d$-degree with $d K$, and $ta_K(sqrtK T)$ if the reward function is not a low-degree.
arXiv Detail & Related papers (2021-01-05T00:56:27Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Blocking Bandits [33.14975454724348]
We consider a novel multi-armed bandit setting, where playing an arm makes it unavailable for a fixed number of time slots thereafter.
We show that with prior knowledge of the rewards and delays of all the arms, the problem of optimizing cumulative reward does not admit any pseudo-polynomial time algorithm.
We design a UCB based algorithm which is shown to have $c log T + o(log T)$ cumulative regret against the greedy algorithm.
arXiv Detail & Related papers (2019-07-27T20:42:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.