Related papers: From Tokens to Layers: Redefining Stall-Free Scheduling for LLM Serving with Layered Prefill

From Tokens to Layers: Redefining Stall-Free Scheduling for LLM Serving with Layered Prefill

URL: http://arxiv.org/abs/2510.08055v1
Date: Thu, 09 Oct 2025 10:41:35 GMT
Title: From Tokens to Layers: Redefining Stall-Free Scheduling for LLM Serving with Layered Prefill
Authors: Gunjun Lee, Jiwon Kim, Jaiyoung Park, Younjoo Lee, Jung Ho Ahn,
Abstract summary: Large Language Model (LLM) inference in production must meet stringent service-level objectives for both time-to-first-token (TTFT) and time-between-token (TBT)<n>Modern serving systems adopt stall-free scheduling techniques such as chunked prefill.<n>We propose layered prefill, a new scheduling paradigm that treats transformer layer groups as the primary scheduling unit.
Score: 8.04085002818041
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Model (LLM) inference in production must meet stringent service-level objectives for both time-to-first-token (TTFT) and time-between-token (TBT) while maximizing throughput under fixed compute, memory, and interconnect budgets. Modern serving systems adopt stall-free scheduling techniques such as chunked prefill, which splits long prompt processing along the token dimension and interleaves prefill with ongoing decode iterations. While effective at stabilizing TBT, chunked prefill incurs substantial overhead in Mixture-of-Experts (MoE) models: redundant expert weight loads increase memory traffic by up to 39% and inflate energy consumption. We propose layered prefill, a new scheduling paradigm that treats transformer layer groups as the primary scheduling unit. By vertically partitioning the model into contiguous layer groups and interleaving prefill and decode across the groups, layered prefill sustains stall-free decoding while eliminating chunk-induced MoE weight reloads. It reduces off-chip bandwidth demand, lowering TTFT by up to 70%, End-to-End latency by 41% and per-token energy by up to 22%. Evaluations show that layered prefill consistently improves the TTFT--TBT Pareto frontier over chunked prefill, reducing expert-load traffic and energy cost while maintaining stall-free decoding. Overall, shifting the scheduling axis from tokens to layers unlocks a new operating regime for high-efficiency, energy-aware LLM serving in co-located environments.

Related papers

FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving [13.856291757420012]
Long-running requests monopolize resources and delay higher-priority ones, leading to widespread time-to-first-token (TTFT) service level violations.<n>We propose FlowPrefill, a TTFT-goodput-optimized serving system that balances execution granularity against scheduling overheads.<n>We show that FlowPrefill improves maximum goodput by up to 5.6$times$ compared to state-of-the-art systems.
arXiv Detail & Related papers (2026-02-18T16:57:45Z)
Staggered Batch Scheduling: Co-optimizing Time-to-First-Token and Throughput for High-Efficiency LLM Inference [17.27010833526918]
Staggered Batch Scheduling (SBS) buffers requests to form optimal execution batches.<n> Load-Aware Global Allocation strategy balances computational load across DP units for both Prefill and Decode phases.<n>Our system reduces TTFT by 30%-40% and improves throughput by 15%-20% compared to state-of-the-art immediate scheduling baselines.
arXiv Detail & Related papers (2025-12-18T03:45:05Z)
Why Should the Server Do It All?: A Scalable, Versatile, and Model-Agnostic Framework for Server-Light DNN Inference over Massively Distributed Clients via Training-Free Intermediate Feature Compression [6.932768187544348]
We introduce SLICER, a retraining-free, architecture-agnostic framework that compresses IFs to reduce both communication and server load in split computing.<n>Across standard vision and LLM workloads, SLICER reduces uplink volume by up to 10x and server GPU time by up to 4.4x.
arXiv Detail & Related papers (2025-11-03T08:44:13Z)
FairBatching: Fairness-Aware Batch Formation for LLM Inference [2.0917668141703207]
This work identifies the root cause of this unfairness: the non-monotonic nature of Time--Tokens (TBT)<n>We propose Fair the Prioritizing, a novel system that enforces fair resource allocation between fill and decode tasks.
arXiv Detail & Related papers (2025-10-16T07:43:56Z)
PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z)
CSGO: Generalized Optimization for Cold Start in Wireless Collaborative Edge LLM Systems [62.24576366776727]
We propose a latency-aware scheduling framework to minimize total inference latency.<n>We show that the proposed method significantly reduces cold-start latency compared to baseline strategies.
arXiv Detail & Related papers (2025-08-15T07:49:22Z)
Optimal Scheduling Algorithms for LLM Inference: Theory and Practice [6.043830060363904]
This paper develops a theoretical framework that models routing and scheduling in Large Language Model inference systems.<n>We identify two key design principles-optimal tiling and dynamic resource allocation-that are essential for achieving high throughput.<n>We show that the Resource-Aware Dynamic (RAD) scheduler achieves throughput optimality under mild conditions.
arXiv Detail & Related papers (2025-08-01T18:12:21Z)
Efficiently Serving Large Multimodal Models Using EPD Disaggregation [24.05805398635414]
We introduce Encode-Prefill-Decode Disaggregation, a novel framework that separates the encoding, prefill, and decode stages onto dedicated resources.<n>Unlike current systems, which bundle encoding and prefill together, our approach decouples these steps, unlocking new opportunities and optimizations.<n> Experimental evaluations with popular LMMs show substantial gains in memory efficiency (up to 15x lower peak memory utilization), batch sizes (up to 22x larger), 10x more images per request, and 2.2x larger KV caches.
arXiv Detail & Related papers (2024-12-25T10:11:31Z)
LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management [23.431794605498084]
We propose Layer KV, a simple yet effective plug-in method that effectively reduces TTFT without requiring additional hardware or compromising output performance. Layer KV introduces layer-wise KV block allocation, management, and offloading for fine-grained control over system memory. Comprehensive evaluations on representative models, ranging from 7B to 70B parameters, across various GPU configurations, demonstrate that Layer KV improves TTFT latency up to 69x and reduces SLO violation rates by 28.7%.
arXiv Detail & Related papers (2024-10-01T06:23:17Z)
FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation. To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed. We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z)
HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator. We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z)
From PEFT to DEFT: Parameter Efficient Finetuning for Reducing Activation Density in Transformers [52.199303258423306]
We propose a novel density loss that encourages higher activation sparsity in pre-trained models. Our proposed method, textbfDEFT, can consistently reduce activation density by up to textbf44.94% on RoBERTa$_mathrmLarge$ and by textbf53.19% (encoder density) and textbf90.60% (decoder density) on Flan-T5$_mathrmXXL$.
arXiv Detail & Related papers (2024-02-02T21:25:46Z)
TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models [52.454274602380124]
Diffusion models heavily depend on the time-step $t$ to achieve satisfactory multi-round denoising. We propose a Temporal Feature Maintenance Quantization (TFMQ) framework building upon a Temporal Information Block. Powered by the pioneering block design, we devise temporal information aware reconstruction (TIAR) and finite set calibration (FSC) to align the full-precision temporal features.
arXiv Detail & Related papers (2023-11-27T12:59:52Z)
Client Orchestration and Cost-Efficient Joint Optimization for NOMA-Enabled Hierarchical Federated Learning [55.49099125128281]
We propose a non-orthogonal multiple access (NOMA) enabled HFL system under semi-synchronous cloud model aggregation. We show that the proposed scheme outperforms the considered benchmarks regarding HFL performance improvement and total cost reduction.
arXiv Detail & Related papers (2023-11-03T13:34:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.