Adaptive Rescheduling in Prefill-Decode Disaggregated LLM Inference
- URL: http://arxiv.org/abs/2510.13668v1
- Date: Wed, 15 Oct 2025 15:29:08 GMT
- Title: Adaptive Rescheduling in Prefill-Decode Disaggregated LLM Inference
- Authors: Zhibin Wang, Zetao Hong, Xue Li, Zibo Wang, Shipeng Li, Qingkai Meng, Qing Wang, Chengying Huan, Rong Gu, Sheng Zhong, Chen Tian,
- Abstract summary: Large Language Model (LLM) inference has emerged as a fundamental paradigm.<n>In this paper, we propose ARES, an adaptive decoding rescheduling system powered by length prediction to anticipate future workloads.
- Score: 29.19884207604898
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large Language Model (LLM) inference has emerged as a fundamental paradigm. In real-world scenarios, variations in output length cause severe workload imbalance in the decode phase, particularly for long-output reasoning tasks. Existing systems, such as PD disaggregation architectures, rely on static prefill-to-decode scheduling, which often results in SLO violations and OOM failures under evolving decode workloads. In this paper, we propose ARES, an adaptive decoding rescheduling system powered by length prediction to anticipate future workloads. Our core contributions include: (1) A lightweight and continuous LLM-native prediction method that leverages LLM hidden state to model remaining generation length with high precision (reducing MAE by 49.42%) and low overhead (cutting predictor parameters by 93.28%); (2) A rescheduling solution in decode phase with : A dynamic balancing mechanism that integrates current and predicted workloads, reducing P99 TPOT by 74.77% and achieving up to 2.24 times higher goodput.
Related papers
- Predicting LLM Output Length via Entropy-Guided Representations [13.351384070796747]
We introduce a lightweight framework that reuses the main model's internal hidden states for efficient length prediction.<n>Our framework features two core components: 1) Entropy-Guided Token Pooling (EGTP), which uses on-the-fly activations and token entropy for highly accurate static prediction.
arXiv Detail & Related papers (2026-02-12T10:49:04Z) - Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios [76.85739138203014]
We present SpecFormer, a novel architecture that accelerates unidirectional and attention mechanisms.<n>We demonstrate that SpecFormer achieves lower training demands and reduced computational costs.
arXiv Detail & Related papers (2025-11-25T14:20:08Z) - Harli: SLO-Aware Co-location of LLM Inference and PEFT-based Finetuning on Model-as-a-Service Platforms [33.64527903547734]
Harli improves the finetune throughput by 46.2% on average (up to over state-of-the-art serving systems)<n>Harli improves the finetune throughput by 46.2% on average (up to over state-of-the-art serving systems)
arXiv Detail & Related papers (2025-11-13T05:58:52Z) - Language Ranker: A Lightweight Ranking framework for LLM Decoding [70.01564145836129]
This paper conceptualizes the decoding process as analogous to the ranking stage in recommendation pipelines.<n>Motivated by this insight, we propose Language Ranker, a novel framework that introduces a lightweight module to rerank candidate responses.<n> Experiments show that Language Ranker achieves performance comparable to large-scale reward models, while requiring only 0.5M additional parameters.
arXiv Detail & Related papers (2025-10-23T17:56:46Z) - Beyond Fixed: Training-Free Variable-Length Denoising for Diffusion Large Language Models [74.15250326312179]
Diffusion Large Language Models offer efficient parallel generation and capable global modeling.<n>The dominant application ofDLLMs is hindered by the need for a statically predefined generation length.<n>We introduce DAEDAL, a novel training-free denoising strategy that enables Dynamic Adaptive Length Expansion.
arXiv Detail & Related papers (2025-08-01T17:56:07Z) - Quantizing Large Language Models for Code Generation: A Differentiated Replication [51.85505914274633]
Large Language Models (LLMs) have shown an impressive capability in code generation and, specifically, to automatically implement requirements described in natural language.<n>LLMs pose significant challenges related to their memory (and, consequently, carbon) footprint.<n>New frontier for LLM quantization is 4-bit precision, resulting in an average memory footprint reduction of 70%.
arXiv Detail & Related papers (2025-03-10T09:26:08Z) - Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective [7.408649506385476]
The escalating scale and cost of Large Language Models (LLMs) training necessitate accurate pre-training prediction of downstream task performance.<n>Current prediction methods lack accuracy and reliability.<n>We propose a Clustering-On-Difficulty (COD) framework for downstream performance prediction.
arXiv Detail & Related papers (2025-02-24T15:44:57Z) - LLM2: Let Large Language Models Harness System 2 Reasoning [65.89293674479907]
Large language models (LLMs) have exhibited impressive capabilities across a myriad of tasks, yet they occasionally yield undesirable outputs.<n>We introduce LLM2, a novel framework that combines an LLM with a process-based verifier.<n>LLMs2 is responsible for generating plausible candidates, while the verifier provides timely process-based feedback to distinguish desirable and undesirable outputs.
arXiv Detail & Related papers (2024-12-29T06:32:36Z) - Progressive Mixed-Precision Decoding for Efficient LLM Inference [49.05448842542558]
We introduce Progressive Mixed-Precision Decoding (PMPD) to address the memory-boundedness of decoding.<n>PMPD achieves 1.4$-$12.2$times$ speedup in matrix-vector multiplications over fp16 models.<n>Our approach delivers a throughput gain of 3.8$-$8.0$times$ over fp16 models and up to 1.54$times$ over uniform quantization approaches.
arXiv Detail & Related papers (2024-10-17T11:46:33Z) - Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models.
We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses.
We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z) - Characterizing the Accuracy -- Efficiency Trade-off of Low-rank Decomposition in Language Models [1.401463252785724]
Low-rank decomposition can be a promising direction for LLM-based applications that require real-time service at scale.
We formalize the low-rank decomposition design space and show that the decomposition design space is enormous.
Our results show that we can achieve a 9% model size reduction with minimal accuracy drops.
arXiv Detail & Related papers (2024-05-10T17:40:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.