Related papers: FairBatching: Fairness-Aware Batch Formation for LLM Inference

FairBatching: Fairness-Aware Batch Formation for LLM Inference

URL: http://arxiv.org/abs/2510.14392v1
Date: Thu, 16 Oct 2025 07:43:56 GMT
Title: FairBatching: Fairness-Aware Batch Formation for LLM Inference
Authors: Hongtao Lyu, Boyue Liu, Mingyu Wu, Haibo Chen,
Abstract summary: This work identifies the root cause of this unfairness: the non-monotonic nature of Time--Tokens (TBT)<n>We propose Fair the Prioritizing, a novel system that enforces fair resource allocation between fill and decode tasks.
Score: 2.0917668141703207
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language model (LLM) inference systems face a fundamental tension between minimizing Time-to-First-Token (TTFT) latency for new requests and maintaining a high, steady token generation rate (low Time-Per-Output-Token, or TPOT) for ongoing requests. Existing stall-free batching schedulers proposed by Sarathi, while effective at preventing decode stalls, introduce significant computational unfairness. They prioritize decode tasks excessively, simultaneously leading to underutilized decode slack and unnecessary prefill queuing delays, which collectively degrade the system's overall quality of service (QoS). This work identifies the root cause of this unfairness: the non-monotonic nature of Time-Between-Tokens (TBT) as a scheduling metric and the rigid decode-prioritizing policy that fails to adapt to dynamic workload bursts. We therefore propose FairBatching, a novel LLM inference scheduler that enforces fair resource allocation between prefill and decode tasks. It features an adaptive batch capacity determination mechanism, which dynamically adjusts the computational budget to improve the GPU utilization without triggering SLO violations. Its fair and dynamic batch formation algorithm breaks away from the decode-prioritizing paradigm, allowing computation resources to be reclaimed from bursting decode tasks to serve prefill surges, achieving global fairness. Furthermore, FairBatching provides a novel load estimation method, enabling more effective coordination with upper-level schedulers. Implemented and evaluated on realistic traces, FairBatching significantly reduces TTFT tail latency by up to 2.29x while robustly maintaining TPOT SLOs, achieving overall 20.0% improvement in single-node capacity and 54.3% improvement in cluster-level capacity.

Related papers

FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving [13.856291757420012]
Long-running requests monopolize resources and delay higher-priority ones, leading to widespread time-to-first-token (TTFT) service level violations.<n>We propose FlowPrefill, a TTFT-goodput-optimized serving system that balances execution granularity against scheduling overheads.<n>We show that FlowPrefill improves maximum goodput by up to 5.6$times$ compared to state-of-the-art systems.
arXiv Detail & Related papers (2026-02-18T16:57:45Z)
PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models [3.088398451509366]
We present PASCAL, a phase-aware scheduling algorithm that prioritizes reasoning to reduce TTFT while using controlled preemption and token pacing during answering to preserve Quality-of-Experience (QoE)<n>Our hierarchical scheduler combines instance-level placement with intra-instance execution to balance load and reduce interference.<n>Across benchmarks using DeepSeek-R1-Distill-Qwen-32B, PASCAL reduces tail TTFT by up to 72% while maintaining answering phase SLO attainment.
arXiv Detail & Related papers (2026-02-12T03:40:44Z)
TS-DP: Reinforcement Speculative Decoding For Temporal Adaptive Diffusion Policy Acceleration [64.32072516882947]
Diffusion Policy excels in embodied control but suffers from high inference latency and computational cost.<n>We propose Temporal-aware Reinforcement-based Speculative Diffusion Policy (TS-DP)<n>TS-DP achieves up to 4.17 times faster inference with over 94% accepted drafts, reaching an inference frequency of 25 Hz.
arXiv Detail & Related papers (2025-12-13T07:53:14Z)
Seer Self-Consistency: Advance Budget Estimation for Adaptive Test-Time Scaling [55.026048429595384]
Test-time scaling improves the inference performance of Large Language Models (LLMs) but also incurs substantial computational costs.<n>We propose SeerSC, a dynamic self-consistency framework that simultaneously improves token efficiency and latency.
arXiv Detail & Related papers (2025-11-12T13:57:43Z)
Algorithms for dynamic scheduling in manufacturing, towards digital factories Improving Deadline Feasibility and Responsiveness via Temporal Networks [0.0]
Traditional deterministic schedules break down when reality deviates from nominal plans.<n>This thesis combines offline constraint-programming with online temporal-network execution to create schedules that remain feasible under worst-case uncertainty.
arXiv Detail & Related papers (2025-10-16T17:28:25Z)
Slim Scheduler: A Runtime-Aware RL and Scheduler System for Efficient CNN Inference [0.0]
Slim Scheduler integrates a Proximal Policy Optimization (PPO) reinforcement learning policy with algorithmic, greedy schedulers to coordinate distributed inference for slimmable models.<n>This hierarchical design reduces search space complexity, mitigates overfitting to specific hardware, and balances efficiency and throughput.
arXiv Detail & Related papers (2025-10-10T05:44:05Z)
From Tokens to Layers: Redefining Stall-Free Scheduling for LLM Serving with Layered Prefill [8.04085002818041]
Large Language Model (LLM) inference in production must meet stringent service-level objectives for both time-to-first-token (TTFT) and time-between-token (TBT)<n>Modern serving systems adopt stall-free scheduling techniques such as chunked prefill.<n>We propose layered prefill, a new scheduling paradigm that treats transformer layer groups as the primary scheduling unit.
arXiv Detail & Related papers (2025-10-09T10:41:35Z)
Intra-request branch orchestration for efficient LLM reasoning [52.68946975865865]
Large Language Models (LLMs) increasingly rely on inference-time reasoning algorithms to improve accuracy on complex tasks.<n>Prior work has largely focused on reducing token usage, often at the expense of accuracy, while overlooking other latency factors.<n>We present DUCHESS, an LLM serving system that reduces cost and latency without sacrificing accuracy through intra-request branch orchestration guided by predictions.
arXiv Detail & Related papers (2025-09-29T15:52:08Z)
End-to-End On-Device Quantization-Aware Training for LLMs at Inference Cost [53.25965863436039]
Quantization-aware training (QAT) provides a more principled solution, but its reliance on backpropagation incurs prohibitive memory costs.<n>We propose ZeroQAT, a zeroth-order optimization-based QAT framework that supports both weight and activation quantization.<n>Experiments show that ZeroQAT consistently outperforms representative PTQ and QAT baselines while requiring significantly less memory.
arXiv Detail & Related papers (2025-08-21T01:18:27Z)
CSGO: Generalized Optimization for Cold Start in Wireless Collaborative Edge LLM Systems [62.24576366776727]
We propose a latency-aware scheduling framework to minimize total inference latency.<n>We show that the proposed method significantly reduces cold-start latency compared to baseline strategies.
arXiv Detail & Related papers (2025-08-15T07:49:22Z)
R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning [80.104336426172]
Chain-of-thought (CoT) enhances problem-solving ability of large language models.<n>CoT incurs substantial inference cost due to long autoregressive trajectories.<n>We introduce R-Stitch, a training-free hybrid decoding framework.
arXiv Detail & Related papers (2025-07-23T08:14:36Z)
Digital Twin-Assisted Federated Learning with Blockchain in Multi-tier Computing Systems [67.14406100332671]
In Industry 4.0 systems, resource-constrained edge devices engage in frequent data interactions. This paper proposes a digital twin (DT) and federated digital twin (FL) scheme. The efficacy of our proposed cooperative interference-based FL process has been verified through numerical analysis.
arXiv Detail & Related papers (2024-11-04T17:48:02Z)
Palantir: Towards Efficient Super Resolution for Ultra-high-definition Live Streaming [29.567573296006515]
Palantir is the first neural-enhanced UHD live streaming system with fine-grained patch-level scheduling. Palantir incurs a negligible scheduling latency accounting for less than 5.7% of the end-to-end latency.
arXiv Detail & Related papers (2024-08-12T13:48:06Z)
Machine Learning for Fairness-Aware Load Shedding: A Real-Time Solution via Identifying Binding Constraints [1.3345486884341395]
We present an efficient machine learning algorithm to enable millisecond-level computation for the optimization-based load shedding problem.<n> Numerical studies on both a 3-bus toy example and a realistic RTS-GMLC system have demonstrated the validity and efficiency of the proposed algorithm.
arXiv Detail & Related papers (2024-07-25T08:47:11Z)
Client Orchestration and Cost-Efficient Joint Optimization for NOMA-Enabled Hierarchical Federated Learning [55.49099125128281]
We propose a non-orthogonal multiple access (NOMA) enabled HFL system under semi-synchronous cloud model aggregation. We show that the proposed scheme outperforms the considered benchmarks regarding HFL performance improvement and total cost reduction.
arXiv Detail & Related papers (2023-11-03T13:34:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.