Queue Length Regret Bounds for Contextual Queueing Bandits
- URL: http://arxiv.org/abs/2601.19300v1
- Date: Tue, 27 Jan 2026 07:40:23 GMT
- Title: Queue Length Regret Bounds for Contextual Queueing Bandits
- Authors: Seoungbin Bae, Garyeong Kang, Dabeen Lee,
- Abstract summary: We introduce contextual queueing bandits, a new context-aware framework for scheduling while simultaneously learning unknown service rates.<n>We show that our algorithm, CQB-$varepsilon$, achieves a regret upper bound of $widetildemathcalO(T-1/4)$.<n>We also consider the setting of adversarially chosen contexts, for which our second algorithm, CQB-Opt, achieves a regret upper bound of $mathcalO(log2 T)$.
- Score: 0.8984888893275712
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce contextual queueing bandits, a new context-aware framework for scheduling while simultaneously learning unknown service rates. Individual jobs carry heterogeneous contextual features, based on which the agent chooses a job and matches it with a server to maximize the departure rate. The service/departure rate is governed by a logistic model of the contextual feature with an unknown server-specific parameter. To evaluate the performance of a policy, we consider queue length regret, defined as the difference in queue length between the policy and the optimal policy. The main challenge in the analysis is that the lists of remaining job features in the queue may differ under our policy versus the optimal policy for a given time step, since they may process jobs in different orders. To address this, we propose the idea of policy-switching queues equipped with a sophisticated coupling argument. This leads to a novel queue length regret decomposition framework, allowing us to understand the short-term effect of choosing a suboptimal job-server pair and its long-term effect on queue state differences. We show that our algorithm, CQB-$\varepsilon$, achieves a regret upper bound of $\widetilde{\mathcal{O}}(T^{-1/4})$. We also consider the setting of adversarially chosen contexts, for which our second algorithm, CQB-Opt, achieves a regret upper bound of $\mathcal{O}(\log^2 T)$. Lastly, we provide experimental results that validate our theoretical findings.
Related papers
- Queueing-Aware Optimization of Reasoning Tokens for Accuracy-Latency Trade-offs in LLM Servers [4.3400407844814985]
We consider a single large language model (LLM) server that serves a heterogeneous stream of queries belonging to $N$ distinct task types.<n>For each task type, the server allocates a fixed number of internal thinking tokens, which determines the computational effort devoted to that query.<n>We formulate a constrained optimization problem that maximizes a weighted average accuracy objective penalized by the mean system time.
arXiv Detail & Related papers (2026-01-15T10:47:11Z) - Near-Optimal Regret-Queue Length Tradeoff in Online Learning for Two-Sided Markets [11.51661878156199]
We study a two-sided market, wherein, price-sensitive heterogeneous customers and servers arrive and join their respective queues.<n>A compatible customer-server pair can then be matched by the platform, at which point, they leave the system.<n>Our objective is to design pricing and matching algorithms that maximize the platform's profit, while maintaining reasonable queue lengths.
arXiv Detail & Related papers (2025-10-15T21:09:41Z) - Rethinking Thinking Tokens: LLMs as Improvement Operators [80.12087211785949]
Reasoning training incentivizes LLMs to produce long chains of thought (long CoT), which allows them to explore solution strategies with self-checking.<n>This results in higher accuracy, but inflates context length, token/compute cost, and answer latency.<n>We ask: Can current models leverage their metacognition to provide other combinations on this Pareto frontier?<n>We identify an interesting inference family Parallel-Distill-Refine (PDR), which performs the following: (i) generate diverse drafts in parallel; (ii) distill them into a bounded, textual workspace; and (iii) refine conditioned on this workspace
arXiv Detail & Related papers (2025-10-01T17:08:59Z) - Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling [83.78874399606379]
We propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling.<n>It comprises four distinct small-scale agents, with clearly defined roles and effective collaboration.<n>It shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks.
arXiv Detail & Related papers (2025-08-05T12:52:09Z) - Learning payoffs while routing in skill-based queues [0.4077787659104315]
We construct a machine learning algorithm that adaptively learns the payoff parameters while maximizing the total payoff.<n>We show that the algorithm is optimal up to logarithmic terms by deriving a regret lower bound.
arXiv Detail & Related papers (2024-12-13T14:33:50Z) - Queueing Matching Bandits with Preference Feedback [10.988222071035198]
We consider multi-class asymmetric queueing systems consisting of $N$ queues on one side and $K$ servers on the other side.<n>The service rate of each job-server assignment is unknown and modeled by a feature-based Multi-nomial Logit (MNL) function.<n>We propose algorithms based on UCB and Thompson Sampling, which achieve system stability with an average queue length bound to $O(minN,K/epsilon)$ for a large time horizon.
arXiv Detail & Related papers (2024-10-14T02:29:06Z) - Autoregressive Bandits [58.46584210388307]
We propose a novel online learning setting, Autoregressive Bandits, in which the observed reward is governed by an autoregressive process of order $k$.
We show that, under mild assumptions on the reward process, the optimal policy can be conveniently computed.
We then devise a new optimistic regret minimization algorithm, namely, AutoRegressive Upper Confidence Bound (AR-UCB), that suffers sublinear regret of order $widetildemathcalO left( frac(k+1)3/2sqrtnT (1-G
arXiv Detail & Related papers (2022-12-12T21:37:36Z) - A Unifying Framework for Online Optimization with Long-Term Constraints [62.35194099438855]
We study online learning problems in which a decision maker has to take a sequence of decisions subject to $m$ long-term constraints.
The goal is to maximize their total reward, while at the same time achieving small cumulative violation across the $T$ rounds.
We present the first best-of-both-world type algorithm for this general class problems, with no-regret guarantees both in the case in which rewards and constraints are selected according to an unknown model, and in the case in which they are selected at each round by an adversary.
arXiv Detail & Related papers (2022-09-15T16:59:19Z) - Learning to Schedule in Parallel-Server Queues with Stochastic Bilinear Rewards [7.519872646378837]
We consider the problem of scheduling in multi-class, parallel-server systems with uncertain rewards from job-server assignments.<n>Our objective is to minimize regret by maximizing the cumulative reward of job-server assignments over a time horizon.<n>Our algorithm achieves a sublinear regret bound and a sublinear mean holding cost.
arXiv Detail & Related papers (2021-12-13T00:37:20Z) - Adversarial Option-Aware Hierarchical Imitation Learning [89.92994158193237]
We propose Option-GAIL, a novel method to learn skills at long horizon.
The key idea of Option-GAIL is modeling the task hierarchy by options and train the policy via generative adversarial optimization.
Experiments show that Option-GAIL outperforms other counterparts consistently across a variety of tasks.
arXiv Detail & Related papers (2021-06-10T06:42:05Z) - Dynamic Multi-Robot Task Allocation under Uncertainty and Temporal
Constraints [52.58352707495122]
We present a multi-robot allocation algorithm that decouples the key computational challenges of sequential decision-making under uncertainty and multi-agent coordination.
We validate our results over a wide range of simulations on two distinct domains: multi-arm conveyor belt pick-and-place and multi-drone delivery dispatch in a city.
arXiv Detail & Related papers (2020-05-27T01:10:41Z) - Learning Algorithms for Minimizing Queue Length Regret [5.8010446129208155]
Packets randomly arrive to a transmitter's queue and wait to be successfully sent to the receiver.
The transmitter's objective is to quickly identify the best channel to minimize the number of packets in the queue over $T$ time slots.
We show that there exists a set of queue-length based policies that can obtain order optimal $O(1)$ queue length regret.
arXiv Detail & Related papers (2020-05-11T15:50:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.