Related papers: FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving

FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving

URL: http://arxiv.org/abs/2602.16603v1
Date: Wed, 18 Feb 2026 16:57:45 GMT
Title: FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving
Authors: Chia-chi Hsieh, Zan Zong, Xinyang Chen, Jianjiang Li, Jidong Zhai, Lijie Wen,
Abstract summary: Long-running requests monopolize resources and delay higher-priority ones, leading to widespread time-to-first-token (TTFT) service level violations.<n>We propose FlowPrefill, a TTFT-goodput-optimized serving system that balances execution granularity against scheduling overheads.<n>We show that FlowPrefill improves maximum goodput by up to 5.6$times$ compared to state-of-the-art systems.
Score: 13.856291757420012
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The growing demand for large language models (LLMs) requires serving systems to handle many concurrent requests with diverse service level objectives (SLOs). This exacerbates head-of-line (HoL) blocking during the compute-intensive prefill phase, where long-running requests monopolize resources and delay higher-priority ones, leading to widespread time-to-first-token (TTFT) SLO violations. While chunked prefill enables interruptibility, it introduces an inherent trade-off between responsiveness and throughput: reducing chunk size improves response latency but degrades computational efficiency, whereas increasing chunk size maximizes throughput but exacerbates blocking. This necessitates an adaptive preemption mechanism. However, dynamically balancing execution granularity against scheduling overheads remains a key challenge. In this paper, we propose FlowPrefill, a TTFT-goodput-optimized serving system that resolves this conflict by decoupling preemption granularity from scheduling frequency. To achieve adaptive prefill scheduling, FlowPrefill introduces two key innovations: 1) Operator-Level Preemption, which leverages operator boundaries to enable fine-grained execution interruption without the efficiency loss associated with fixed small chunking; and 2) Event-Driven Scheduling, which triggers scheduling decisions only upon request arrival or completion events, thereby supporting efficient preemption responsiveness while minimizing control-plane overhead. Evaluation on real-world production traces shows that FlowPrefill improves maximum goodput by up to 5.6$\times$ compared to state-of-the-art systems while satisfying heterogeneous SLOs.

Related papers

HALO: Semantic-Aware Distributed LLM Inference in Lossy Edge Network [50.33808558714122]
Large language models' (LLMs) inference at the edge can facilitate prompt service responsiveness while protecting user privacy.<n>We propose HALO, a novel framework that can boost the distributed LLM inference in lossy edge network.<n> Experimental results from a Raspberry Pi cluster demonstrate that HALO achieves a 3.41x end-to-end speedup for LLaMA-series LLMs under unreliable network conditions.
arXiv Detail & Related papers (2026-01-16T07:37:23Z)
AugServe: Adaptive Request Scheduling for Augmented Large Language Model Inference Serving [6.505016440664893]
AugServe is an efficient inference framework designed to reduce queueing latency and enhance effective throughput for augmented large language models (LLMs)<n> Experimental results show that AugServe achieves 4.7-33.1x and 3.3-13.2x higher effective throughput than vLLM and InferCept, while reducing fluctuating time-to-first-token (TTFT) by up to 96.3% and 95.0%, respectively.
arXiv Detail & Related papers (2025-12-03T17:49:38Z)
FairBatching: Fairness-Aware Batch Formation for LLM Inference [2.0917668141703207]
This work identifies the root cause of this unfairness: the non-monotonic nature of Time--Tokens (TBT)<n>We propose Fair the Prioritizing, a novel system that enforces fair resource allocation between fill and decode tasks.
arXiv Detail & Related papers (2025-10-16T07:43:56Z)
Intra-request branch orchestration for efficient LLM reasoning [52.68946975865865]
Large Language Models (LLMs) increasingly rely on inference-time reasoning algorithms to improve accuracy on complex tasks.<n>Prior work has largely focused on reducing token usage, often at the expense of accuracy, while overlooking other latency factors.<n>We present DUCHESS, an LLM serving system that reduces cost and latency without sacrificing accuracy through intra-request branch orchestration guided by predictions.
arXiv Detail & Related papers (2025-09-29T15:52:08Z)
HyperFlexis: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling [19.154782641360253]
Modern large language model (LLM) serving systems face challenges from highly variable requests with diverse lengths, priorities, and stage-specific service-level objectives (SLOs)<n>We present HyperFlexis, a unified LLM serving system that integrates algorithmic and system-level innovations to jointly optimize scheduling and scaling under multiple SLOs.
arXiv Detail & Related papers (2025-08-21T18:40:20Z)
CSGO: Generalized Optimization for Cold Start in Wireless Collaborative Edge LLM Systems [62.24576366776727]
We propose a latency-aware scheduling framework to minimize total inference latency.<n>We show that the proposed method significantly reduces cold-start latency compared to baseline strategies.
arXiv Detail & Related papers (2025-08-15T07:49:22Z)
Task-Oriented Feature Compression for Multimodal Understanding via Device-Edge Co-Inference [54.53508601749513]
We propose a task-oriented feature compression (TOFC) method for multimodal understanding in a device-edge co-inference framework.<n>To enhance compression efficiency, multiple entropy models are adaptively selected based on the characteristics of the visual features.<n>Results show that TOFC achieves up to 52% reduction in data transmission overhead and 63% reduction in system latency.
arXiv Detail & Related papers (2025-03-17T08:37:22Z)
Priority-Aware Preemptive Scheduling for Mixed-Priority Workloads in MoE Inference [4.7730970530715835]
Large Language Models have revolutionized natural language processing, yet serving them efficiently in data centers remains challenging.<n>We introduce QLLM, a novel inference system designed for Mixture of Experts (MoE) models.<n>QLLM enables expert-level preemption, deferring BE job execution while minimizing LS time-to-first-token (TTFT)
arXiv Detail & Related papers (2025-03-12T11:56:01Z)
Digital Twin-Assisted Federated Learning with Blockchain in Multi-tier Computing Systems [67.14406100332671]
In Industry 4.0 systems, resource-constrained edge devices engage in frequent data interactions. This paper proposes a digital twin (DT) and federated digital twin (FL) scheme. The efficacy of our proposed cooperative interference-based FL process has been verified through numerical analysis.
arXiv Detail & Related papers (2024-11-04T17:48:02Z)
ALISE: Accelerating Large Language Model Serving with Speculative Scheduling [7.367068885621016]
Large Language Models (LLMs) represent a revolutionary advancement in the contemporary landscape of artificial general intelligence (AGI) In this paper, we propose a new efficient LLM inference serving framework, named ALISE. We show that ALISE improves the throughput of inference serving by up to 1.8x and 2.1x under the same latency constraint on the Alpaca and ShareGPT datasets, respectively.
arXiv Detail & Related papers (2024-10-31T00:58:11Z)
Don't Stop Me Now: Embedding Based Scheduling for LLMs [22.099820814682513]
Size-based scheduling algorithms like Shortest Remaining Process Time (SRPT) aim to reduce average request completion time. We propose a prediction-based SRPT variant with limited preemption designed to account for memory overhead in LLM systems.
arXiv Detail & Related papers (2024-10-01T19:51:07Z)
Client Orchestration and Cost-Efficient Joint Optimization for NOMA-Enabled Hierarchical Federated Learning [55.49099125128281]
We propose a non-orthogonal multiple access (NOMA) enabled HFL system under semi-synchronous cloud model aggregation. We show that the proposed scheme outperforms the considered benchmarks regarding HFL performance improvement and total cost reduction.
arXiv Detail & Related papers (2023-11-03T13:34:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.