Multiple-play Stochastic Bandits with Prioritized Arm Capacity Sharing
- URL: http://arxiv.org/abs/2512.21626v1
- Date: Thu, 25 Dec 2025 11:19:09 GMT
- Title: Multiple-play Stochastic Bandits with Prioritized Arm Capacity Sharing
- Authors: Hong Xie, Haoran Gu, Yanying Huang, Tao Tan, Defu Lian,
- Abstract summary: The model is composed of $M$ arms and $K$ plays.<n>Each arm has a number of capacities, and each unit of capacity is associated with a reward function.<n>When multiple plays compete for the arm capacity, the arm capacity is allocated in a larger priority weight first manner.
- Score: 52.124267908936396
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper proposes a variant of multiple-play stochastic bandits tailored to resource allocation problems arising from LLM applications, edge intelligence, etc. The model is composed of $M$ arms and $K$ plays. Each arm has a stochastic number of capacities, and each unit of capacity is associated with a reward function. Each play is associated with a priority weight. When multiple plays compete for the arm capacity, the arm capacity is allocated in a larger priority weight first manner. Instance independent and instance dependent regret lower bounds of $Ω( α_1 σ\sqrt{KM T} )$ and $Ω(α_1 σ^2 \frac{M}Δ \ln T)$ are proved, where $α_1$ is the largest priority weight and $σ$ characterizes the reward tail. When model parameters are given, we design an algorithm named \texttt{MSB-PRS-OffOpt} to locate the optimal play allocation policy with a computational complexity of $O(MK^3)$. Utilizing \texttt{MSB-PRS-OffOpt} as a subroutine, an approximate upper confidence bound (UCB) based algorithm is designed, which has instance independent and instance dependent regret upper bounds matching the corresponding lower bound up to factors of $ \sqrt{K \ln KT }$ and $α_1 K^2$ respectively. To this end, we address nontrivial technical challenges arising from optimizing and learning under a special nonlinear combinatorial utility function induced by the prioritized resource sharing mechanism.
Related papers
- One Good Source is All You Need: Near-Optimal Regret for Bandits under Heterogeneous Noise [49.12618706309658]
Source-Optimistic Adaptive Regret minimization (SOAR) is a novel algorithm that prunes high-variance sources using sharp variance-concentration bounds.<n>We show it achieves the optimal instance-dependent regret of standard single-source MAB with variance $*2$.<n>Our theoretical bounds represent a significant improvement over some proposed baselines.
arXiv Detail & Related papers (2026-02-16T05:25:06Z) - Learning to Allocate Resources with Censored Feedback [37.26445474395888]
We study the online resource allocation problem in which a budget $B$ must be allocated across $K$ arms under censored feedback.<n>We propose RA-UCB, an optimistic algorithm that leverages non-trivial parameter estimation and confidence bounds.<n>We then validate our theoretical results through experiments on real-world datasets.
arXiv Detail & Related papers (2026-02-06T10:04:54Z) - Batched Stochastic Matching Bandits [43.651070266360954]
We introduce a novel bandit framework for matching based on the Multi-nomial Logit (MNL) choice model.<n>In our setting, $N$ agents on one side are assigned to $K$ arms on the other side, where each armally selects an agent from its assigned pool according to an unknown preference.<n>The objective is to minimize regret by maximizing the cumulative revenue from successful matches across all agents.
arXiv Detail & Related papers (2025-09-04T13:16:32Z) - Combinatorial Logistic Bandits [30.829239785016934]
We introduce a novel framework called logistic bandits (CLogB)<n>In each round, a subset of base arms (called the super arm) is selected, with the outcome of each base arm being binary.<n>Experiments on real-world datasets demonstrate the superior performance of our algorithms compared to benchmark algorithms.
arXiv Detail & Related papers (2024-10-22T14:52:46Z) - Optimal level set estimation for non-parametric tournament and crowdsourcing problems [49.75262185577198]
Motivated by crowdsourcing, we consider a problem where we partially observe the correctness of the answers of $n$ experts on $d$ questions.
In this paper, we assume that the matrix $M$ containing the probability that expert $i$ answers correctly to question $j$ is bi-isotonic up to a permutation of it rows and columns.
We construct an efficient-time algorithm that turns out to be minimax optimal for this classification problem.
arXiv Detail & Related papers (2024-08-27T18:28:31Z) - Combinatorial Stochastic-Greedy Bandit [79.1700188160944]
We propose a novelgreedy bandit (SGB) algorithm for multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time $tin [T]$ is observed.
SGB adopts an optimized-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms.
arXiv Detail & Related papers (2023-12-13T11:08:25Z) - Contextual Combinatorial Bandits with Probabilistically Triggered Arms [55.9237004478033]
We study contextual bandits with probabilistically triggered arms (C$2$MAB-T) under a variety of smoothness conditions.
Under the triggering modulated (TPM) condition, we devise the C$2$-UC-T algorithm and derive a regret bound $tildeO(dsqrtT)$.
arXiv Detail & Related papers (2023-03-30T02:51:00Z) - Complexity Analysis of a Countable-armed Bandit Problem [9.163501953373068]
We study the classical problem of minimizing the expected cumulative regret over a horizon of play $n$.
We propose algorithms that achieve a rate-optimal finite-time instance-dependent regret of $mathcalOleft( log n right)$ when $K=2$.
While the order of regret and complexity of the problem suggests a great degree of similarity to the classical MAB problem, properties of the performance bounds and salient aspects of algorithm design are quite distinct from the latter.
arXiv Detail & Related papers (2023-01-18T00:53:46Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z) - Contextual Combinatorial Bandits with Changing Action Sets via Gaussian Processes [8.919345630832366]
We consider a contextual bandit problem with a action set and time-varying base arm availability.<n>We propose an algorithm called Optimistic Combinatorial Learning and Optimization with Kernel Upper Confidence Bounds (O'CLOK-UCB)<n>To dramatically speed up the algorithm, we also propose a variant of O'CLOK-UCB that uses sparse GPs.
arXiv Detail & Related papers (2021-10-05T18:02:10Z) - Top-$k$ eXtreme Contextual Bandits with Arm Hierarchy [71.17938026619068]
We study the top-$k$ extreme contextual bandits problem, where the total number of arms can be enormous.
We first propose an algorithm for the non-extreme realizable setting, utilizing the Inverse Gap Weighting strategy.
We show that our algorithm has a regret guarantee of $O(ksqrt(A-k+1)T log (|mathcalF|T))$.
arXiv Detail & Related papers (2021-02-15T19:10:52Z) - Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits
with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints.
For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.