Related papers: Combinatorial Bandits for Maximum Value Reward Function under Max Value-Index Feedback

Combinatorial Bandits for Maximum Value Reward Function under Max Value-Index Feedback

URL: http://arxiv.org/abs/2305.16074v1
Date: Thu, 25 May 2023 14:02:12 GMT
Title: Combinatorial Bandits for Maximum Value Reward Function under Max Value-Index Feedback
Authors: Yiliu Wang, Wei Chen, and Milan Vojnovi\'c
Abstract summary: We consider a multi-armed bandit problem for maximum value reward function under maximum value and index feedback. We propose an algorithm and provide a regret bound for problem instances with arm outcomes according to arbitrary distributions with finite supports. Our algorithm achieves a $O((k/Delta)log(T))$ distribution-dependent and a $tildeO(sqrtT)$ distribution-independent regret.
Score: 9.771002043127728
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We consider a combinatorial multi-armed bandit problem for maximum value reward function under maximum value and index feedback. This is a new feedback structure that lies in between commonly studied semi-bandit and full-bandit feedback structures. We propose an algorithm and provide a regret bound for problem instances with stochastic arm outcomes according to arbitrary distributions with finite supports. The regret analysis rests on considering an extended set of arms, associated with values and probabilities of arm outcomes, and applying a smoothness condition. Our algorithm achieves a $O((k/\Delta)\log(T))$ distribution-dependent and a $\tilde{O}(\sqrt{T})$ distribution-independent regret where $k$ is the number of arms selected in each round, $\Delta$ is a distribution-dependent reward gap and $T$ is the horizon time. Perhaps surprisingly, the regret bound is comparable to previously-known bound under more informative semi-bandit feedback. We demonstrate the effectiveness of our algorithm through experimental results.

Related papers

Improved Best-of-Both-Worlds Regret for Bandits with Delayed Feedback [40.86836752251116]
We study the multi-armed bandit problem with adversarially chosen delays in the Best-Both-Worlds (BoBW) framework.<n>Our main contribution is a new algorithm that matches the known lower bounds in each setting individually.<n>We provide a regret bound which scale as $sum_i>0left(log T/Delta_iright) + frac1Ksum Delta_i sigma_max$, where $Delta_i$ is the sub-optimality gap of arm $i$ and $
arXiv Detail & Related papers (2025-05-30T04:05:52Z)
Continuous K-Max Bandits [54.21533414838677]
We study the $K$-Max multi-armed bandits problem with continuous outcome distributions and weak value-index feedback. This setting captures critical applications in recommendation systems, distributed computing, server scheduling, etc. Our key contribution is the computationally efficient algorithm DCK-UCB, which combines adaptive discretization with bias-corrected confidence bounds.
arXiv Detail & Related papers (2025-02-19T06:37:37Z)
A General Framework for Clustering and Distribution Matching with Bandit Feedback [81.50716021326194]
We develop a general framework for clustering and distribution matching problems with bandit feedback. We derive a non-asymptotic lower bound on the average number of arm pulls for any online algorithm with an error probability not exceeding $delta$.
arXiv Detail & Related papers (2024-09-08T12:19:12Z)
Combinatorial Stochastic-Greedy Bandit [79.1700188160944]
We propose a novelgreedy bandit (SGB) algorithm for multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time $tin [T]$ is observed. SGB adopts an optimized-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms.
arXiv Detail & Related papers (2023-12-13T11:08:25Z)
Contextual Combinatorial Volatile Bandits via Gaussian Processes [10.312968200748116]
We consider a contextual bandit problem with a set of available base arms and their contexts. We propose an algorithm called Optimistic Combinatorial Learning and Optimization with Kernel Upper Confidence Bounds (O'CLOK-UCB) We experimentally show that both algorithms vastly outperform the previous state-of-the-art UCB-based algorithms in realistic setups.
arXiv Detail & Related papers (2021-10-05T18:02:10Z)
Continuous Time Bandits With Sampling Costs [17.412117389855222]
We consider a continuous-time multi-arm bandit problem (CTMAB), where the learner can sample arms any number of times in a given interval and obtain a random reward from each sample. There is a tradeoff between obtaining large reward and incurring sampling cost as a function of the sampling frequency. The goal is to design a learning algorithm that minimizes regret, that is defined as the difference of the payoff of the oracle policy and that of the learning algorithm.
arXiv Detail & Related papers (2021-07-12T10:00:35Z)
Combinatorial Bandits without Total Order for Arms [52.93972547896022]
We present a reward model that captures set-dependent reward distribution and assumes no total order for arms. We develop a novel regret analysis and show an $Oleft(frack2 n log Tepsilonright)$ gap-dependent regret bound as well as an $Oleft(k2sqrtn T log Tright)$ gap-independent regret bound.
arXiv Detail & Related papers (2021-03-03T23:08:59Z)
Top-$k$ eXtreme Contextual Bandits with Arm Hierarchy [71.17938026619068]
We study the top-$k$ extreme contextual bandits problem, where the total number of arms can be enormous. We first propose an algorithm for the non-extreme realizable setting, utilizing the Inverse Gap Weighting strategy. We show that our algorithm has a regret guarantee of $O(ksqrt(A-k+1)T log (|mathcalF|T))$.
arXiv Detail & Related papers (2021-02-15T19:10:52Z)
Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints. For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z)
Lenient Regret for Multi-Armed Bandits [72.56064196252498]
We consider the Multi-Armed Bandit (MAB) problem, where an agent sequentially chooses actions and observes rewards for the actions it took. While the majority of algorithms try to minimize the regret, i.e., the cumulative difference between the reward of the best action and the agent's action, this criterion might lead to undesirable results. We suggest a new, more lenient, regret criterion that ignores suboptimality gaps smaller than some $epsilon$.
arXiv Detail & Related papers (2020-08-10T08:30:52Z)
Blocking Bandits [33.14975454724348]
We consider a novel multi-armed bandit setting, where playing an arm makes it unavailable for a fixed number of time slots thereafter. We show that with prior knowledge of the rewards and delays of all the arms, the problem of optimizing cumulative reward does not admit any pseudo-polynomial time algorithm. We design a UCB based algorithm which is shown to have $c log T + o(log T)$ cumulative regret against the greedy algorithm.
arXiv Detail & Related papers (2019-07-27T20:42:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.