Related papers: Token Management in Multi-Tenant AI Inference Platforms

Token Management in Multi-Tenant AI Inference Platforms

URL: http://arxiv.org/abs/2603.00356v1
Date: Fri, 27 Feb 2026 22:44:09 GMT
Title: Token Management in Multi-Tenant AI Inference Platforms
Authors: William J. Cunningham,
Abstract summary: Multi-tenant AI inference platforms must balance resource utilization against service-level guarantees under variable demand.<n>We introduce emphtoken pools, a control-plane abstraction that represents capacity as explicit entitlements expressed in inference-native units.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-tenant AI inference platforms must balance resource utilization against service-level guarantees under variable demand. Conventional approaches fail to achieve this balance: dedicated endpoints strand capacity on idle models, while rate limits ignore the heterogeneous cost of inference requests. We introduce \emph{token pools}, a control-plane abstraction that represents inference capacity as explicit entitlements expressed in inference-native units (token throughput, KV cache, concurrency). Unlike rate limits, which govern request admission without regard to execution cost, token pools authorize both admission and autoscaling from the same capacity model, ensuring consistency between what is promised and what is provisioned. The abstraction captures burst modes across multiple dimensions invisible to conventional throttling. Dynamic per-entitlement limits on each burst dimension enable fine-grained control over resource consumption while permitting work-conserving backfill by low-priority traffic. The design supports priority-aware allocation, service tiers with differentiated guarantees, and debt-based fairness mechanisms, all without modifying the underlying inference runtime or cluster scheduler. In experiments on a Kubernetes cluster with vLLM backends, token pools maintain a bounded P99 latency for guaranteed workloads during overload by selectively throttling spot traffic, while a baseline without admission control experiences unbounded latency degradation across all workloads. A second experiment demonstrates debt-based fair-share convergence among elastic workloads with heterogeneous SLO requirements during capacity scarcity.

Related papers

FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving [13.856291757420012]
Long-running requests monopolize resources and delay higher-priority ones, leading to widespread time-to-first-token (TTFT) service level violations.<n>We propose FlowPrefill, a TTFT-goodput-optimized serving system that balances execution granularity against scheduling overheads.<n>We show that FlowPrefill improves maximum goodput by up to 5.6$times$ compared to state-of-the-art systems.
arXiv Detail & Related papers (2026-02-18T16:57:45Z)
High-Fidelity Network Management for Federated AI-as-a-Service: Cross-Domain Orchestration [0.12234742322758417]
This paper introduces an assurance-oriented AI management plane based on Tail-Risk Envelopes (TREs)<n>TREs are signed, composable per-domain descriptors that combine deterministic guardrails with rate-latency-impairment models.<n>We show that tenant-level reservations prevent bursty traffic from inflating tail latency under TRE contracts.
arXiv Detail & Related papers (2026-02-17T00:40:04Z)
HALO: Semantic-Aware Distributed LLM Inference in Lossy Edge Network [50.33808558714122]
Large language models' (LLMs) inference at the edge can facilitate prompt service responsiveness while protecting user privacy.<n>We propose HALO, a novel framework that can boost the distributed LLM inference in lossy edge network.<n> Experimental results from a Raspberry Pi cluster demonstrate that HALO achieves a 3.41x end-to-end speedup for LLaMA-series LLMs under unreliable network conditions.
arXiv Detail & Related papers (2026-01-16T07:37:23Z)
RepetitionCurse: Measuring and Understanding Router Imbalance in Mixture-of-Experts LLMs under DoS Stress [16.010076395422264]
We show that out-of-distribution prompts can manipulate the routing strategy, which creates computational bottlenecks on certain devices while forcing others to idle.<n>We propose RepetitionCurse, a low-cost black-box strategy to exploit this vulnerability.
arXiv Detail & Related papers (2025-12-30T05:24:26Z)
Training-free Context-adaptive Attention for Efficient Long Context Modeling [57.703159205740185]
Training-free Context-adaptive Attention (TCA-Attention) is a training-free sparse attention mechanism that selectively attends to only the informative tokens for efficient long-context inference.<n>TCA-Attention achieves a 2.8$times$ speedup and reduces KV cache by 61% at 128K context length while maintaining performance comparable to full attention.
arXiv Detail & Related papers (2025-12-10T01:54:57Z)
Adaptive Neighborhood-Constrained Q Learning for Offline Reinforcement Learning [52.03884701766989]
offline reinforcement learning (RL) algorithms typically impose constraints on action selection.<n>We propose a new neighborhood constraint that restricts action selection in the Bellman target to the union of neighborhoods of dataset actions.<n>We develop a simple yet effective algorithm, Adaptive Neighborhood-constrained Q learning (ANQ), to perform Q learning with target actions satisfying this constraint.
arXiv Detail & Related papers (2025-11-04T13:42:05Z)
FairBatching: Fairness-Aware Batch Formation for LLM Inference [2.0917668141703207]
This work identifies the root cause of this unfairness: the non-monotonic nature of Time--Tokens (TBT)<n>We propose Fair the Prioritizing, a novel system that enforces fair resource allocation between fill and decode tasks.
arXiv Detail & Related papers (2025-10-16T07:43:56Z)
Single Agent Robust Deep Reinforcement Learning for Bus Fleet Control [9.910562011343009]
Bus bunching is a challenge for urban transit due to traffic and passenger demand.<n>We propose a novel single-agent reinforcement learning framework for bus holding control.<n>We show that our modified soft actor-critic achieves more stable and superior performance than benchmarks.
arXiv Detail & Related papers (2025-08-28T13:47:40Z)
Diffusion Predictive Control with Constraints [51.91057765703533]
Diffusion predictive control with constraints (DPCC) is an algorithm for diffusion-based control with explicit state and action constraints.<n>We show through simulations of a robot manipulator that DPCC outperforms existing methods in satisfying novel test-time constraints.
arXiv Detail & Related papers (2024-12-12T15:10:22Z)
Compositional Diffusion-Based Continuous Constraint Solvers [98.1702285470628]
This paper introduces an approach for learning to solve continuous constraint satisfaction problems (CCSP) in robotic reasoning and planning. By contrast, our model, the compositional diffusion continuous constraint solver (Diffusion-CCSP), derives global solutions to CCSPs.
arXiv Detail & Related papers (2023-09-02T15:20:36Z)
Adaptive Subcarrier, Parameter, and Power Allocation for Partitioned Edge Learning Over Broadband Channels [69.18343801164741]
partitioned edge learning (PARTEL) implements parameter-server training, a well known distributed learning method, in wireless network. We consider the case of deep neural network (DNN) models which can be trained using PARTEL by introducing some auxiliary variables.
arXiv Detail & Related papers (2020-10-08T15:27:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.