Related papers: Multiple-play Stochastic Bandits with Prioritized Arm Capacity Sharing

Multiple-play Stochastic Bandits with Prioritized Arm Capacity Sharing

URL: http://arxiv.org/abs/2512.21626v1
Date: Thu, 25 Dec 2025 11:19:09 GMT
Title: Multiple-play Stochastic Bandits with Prioritized Arm Capacity Sharing
Authors: Hong Xie, Haoran Gu, Yanying Huang, Tao Tan, Defu Lian,
Abstract summary: The model is composed of $M$ arms and $K$ plays.<n>Each arm has a number of capacities, and each unit of capacity is associated with a reward function.<n>When multiple plays compete for the arm capacity, the arm capacity is allocated in a larger priority weight first manner.
Score: 52.124267908936396
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper proposes a variant of multiple-play stochastic bandits tailored to resource allocation problems arising from LLM applications, edge intelligence, etc. The model is composed of $M$ arms and $K$ plays. Each arm has a stochastic number of capacities, and each unit of capacity is associated with a reward function. Each play is associated with a priority weight. When multiple plays compete for the arm capacity, the arm capacity is allocated in a larger priority weight first manner. Instance independent and instance dependent regret lower bounds of $Ω( α_1 σ\sqrt{KM T} )$ and $Ω(α_1 σ^2 \frac{M}Δ \ln T)$ are proved, where $α_1$ is the largest priority weight and $σ$ characterizes the reward tail. When model parameters are given, we design an algorithm named \texttt{MSB-PRS-OffOpt} to locate the optimal play allocation policy with a computational complexity of $O(MK^3)$. Utilizing \texttt{MSB-PRS-OffOpt} as a subroutine, an approximate upper confidence bound (UCB) based algorithm is designed, which has instance independent and instance dependent regret upper bounds matching the corresponding lower bound up to factors of $ \sqrt{K \ln KT }$ and $α_1 K^2$ respectively. To this end, we address nontrivial technical challenges arising from optimizing and learning under a special nonlinear combinatorial utility function induced by the prioritized resource sharing mechanism.

Related papers

One Good Source is All You Need: Near-Optimal Regret for Bandits under Heterogeneous Noise [49.12618706309658]
Source-Optimistic Adaptive Regret minimization (SOAR) is a novel algorithm that prunes high-variance sources using sharp variance-concentration bounds.<n>We show it achieves the optimal instance-dependent regret of standard single-source MAB with variance $*2$.<n>Our theoretical bounds represent a significant improvement over some proposed baselines.
arXiv Detail & Related papers (2026-02-16T05:25:06Z)
Learning to Allocate Resources with Censored Feedback [37.26445474395888]
We study the online resource allocation problem in which a budget $B$ must be allocated across $K$ arms under censored feedback.<n>We propose RA-UCB, an optimistic algorithm that leverages non-trivial parameter estimation and confidence bounds.<n>We then validate our theoretical results through experiments on real-world datasets.
arXiv Detail & Related papers (2026-02-06T10:04:54Z)
Batched Stochastic Matching Bandits [43.651070266360954]
We introduce a novel bandit framework for matching based on the Multi-nomial Logit (MNL) choice model.<n>In our setting, $N$ agents on one side are assigned to $K$ arms on the other side, where each armally selects an agent from its assigned pool according to an unknown preference.<n>The objective is to minimize regret by maximizing the cumulative revenue from successful matches across all agents.
arXiv Detail & Related papers (2025-09-04T13:16:32Z)
Combinatorial Logistic Bandits [30.829239785016934]
We introduce a novel framework called logistic bandits (CLogB)<n>In each round, a subset of base arms (called the super arm) is selected, with the outcome of each base arm being binary.<n>Experiments on real-world datasets demonstrate the superior performance of our algorithms compared to benchmark algorithms.
arXiv Detail & Related papers (2024-10-22T14:52:46Z)
Optimal level set estimation for non-parametric tournament and crowdsourcing problems [49.75262185577198]
Motivated by crowdsourcing, we consider a problem where we partially observe the correctness of the answers of $n$ experts on $d$ questions. In this paper, we assume that the matrix $M$ containing the probability that expert $i$ answers correctly to question $j$ is bi-isotonic up to a permutation of it rows and columns. We construct an efficient-time algorithm that turns out to be minimax optimal for this classification problem.
arXiv Detail & Related papers (2024-08-27T18:28:31Z)
Combinatorial Stochastic-Greedy Bandit [79.1700188160944]
We propose a novelgreedy bandit (SGB) algorithm for multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time $tin [T]$ is observed. SGB adopts an optimized-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms.
arXiv Detail & Related papers (2023-12-13T11:08:25Z)
Contextual Combinatorial Bandits with Probabilistically Triggered Arms [55.9237004478033]
We study contextual bandits with probabilistically triggered arms (C$2$MAB-T) under a variety of smoothness conditions. Under the triggering modulated (TPM) condition, we devise the C$2$-UC-T algorithm and derive a regret bound $tildeO(dsqrtT)$.
arXiv Detail & Related papers (2023-03-30T02:51:00Z)
Complexity Analysis of a Countable-armed Bandit Problem [9.163501953373068]
We study the classical problem of minimizing the expected cumulative regret over a horizon of play $n$. We propose algorithms that achieve a rate-optimal finite-time instance-dependent regret of $mathcalOleft( log n right)$ when $K=2$. While the order of regret and complexity of the problem suggests a great degree of similarity to the classical MAB problem, properties of the performance bounds and salient aspects of algorithm design are quite distinct from the latter.
arXiv Detail & Related papers (2023-01-18T00:53:46Z)
Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning. The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z)
Contextual Combinatorial Bandits with Changing Action Sets via Gaussian Processes [8.919345630832366]
We consider a contextual bandit problem with a action set and time-varying base arm availability.<n>We propose an algorithm called Optimistic Combinatorial Learning and Optimization with Kernel Upper Confidence Bounds (O'CLOK-UCB)<n>To dramatically speed up the algorithm, we also propose a variant of O'CLOK-UCB that uses sparse GPs.
arXiv Detail & Related papers (2021-10-05T18:02:10Z)
Top-$k$ eXtreme Contextual Bandits with Arm Hierarchy [71.17938026619068]
We study the top-$k$ extreme contextual bandits problem, where the total number of arms can be enormous. We first propose an algorithm for the non-extreme realizable setting, utilizing the Inverse Gap Weighting strategy. We show that our algorithm has a regret guarantee of $O(ksqrt(A-k+1)T log (|mathcalF|T))$.
arXiv Detail & Related papers (2021-02-15T19:10:52Z)
Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints. For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.