AugServe: Adaptive Request Scheduling for Augmented Large Language Model Inference Serving
- URL: http://arxiv.org/abs/2512.04013v1
- Date: Wed, 03 Dec 2025 17:49:38 GMT
- Title: AugServe: Adaptive Request Scheduling for Augmented Large Language Model Inference Serving
- Authors: Ying Wang, Zhen Jin, Jiexiong Xu, Wenhai Lin, Yiquan Chen, Wenzhi Chen,
- Abstract summary: AugServe is an efficient inference framework designed to reduce queueing latency and enhance effective throughput for augmented large language models (LLMs)<n> Experimental results show that AugServe achieves 4.7-33.1x and 3.3-13.2x higher effective throughput than vLLM and InferCept, while reducing fluctuating time-to-first-token (TTFT) by up to 96.3% and 95.0%, respectively.
- Score: 6.505016440664893
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As augmented large language models (LLMs) with external tools become increasingly popular in web applications, improving augmented LLM inference serving efficiency and optimizing service-level objectives (SLOs) are critical for enhancing user experience. To achieve this, inference systems must maximize request handling within latency constraints, referred to as increasing effective throughput. However, existing systems face two major challenges: (i) reliance on first-come-first-served (FCFS) scheduling causes severe head-of-line blocking, leading to queuing delays exceeding the SLOs for many requests; and (ii) static batch token limit, which fails to adapt to fluctuating loads and hardware conditions. Both of these factors degrade effective throughput and service quality. This paper presents AugServe, an efficient inference framework designed to reduce queueing latency and enhance effective throughput for augmented LLM inference services. The core idea of AugServe is a two-stage adaptive request scheduling strategy. Specifically, AugServe combines the inference features of augmented LLM requests to optimize the order of scheduling decisions (stage I). These decisions are continuously refined with runtime information (stage II), adapting to both request characteristics and system capabilities. In addition, AugServe dynamically adjusts the token batching mechanism based on hardware status and real-time load, further enhancing throughput performance. Experimental results show that AugServe achieves 4.7-33.1x and 3.3-13.2x higher effective throughput than vLLM and InferCept, while reducing time-to-first-token (TTFT) by up to 96.3% and 95.0%, respectively.
Related papers
- FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving [13.856291757420012]
Long-running requests monopolize resources and delay higher-priority ones, leading to widespread time-to-first-token (TTFT) service level violations.<n>We propose FlowPrefill, a TTFT-goodput-optimized serving system that balances execution granularity against scheduling overheads.<n>We show that FlowPrefill improves maximum goodput by up to 5.6$times$ compared to state-of-the-art systems.
arXiv Detail & Related papers (2026-02-18T16:57:45Z) - AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering [52.67783579040657]
AceGRPO is a machine learning system that prioritizes tasks at the agent's learning frontier to maximize learning efficiency.<n>Our trained Ace-30B model achieves a 100% valid submission rate on MLE-Bench-Lite, approaches the performance of proprietary frontier models, and outperforms larger open-source baselines.
arXiv Detail & Related papers (2026-02-08T10:55:03Z) - EWSJF: An Adaptive Scheduler with Hybrid Partitioning for Mixed-Workload LLM Inference [1.7969777786551429]
EWSJF (Effective Workload-based Shortest Job First) learns workload structure in real time to jointly improve fairness and throughput.<n>EWSJF improves end-to-end throughput by over 30% and reduces average Time-To-First-Token for short requests by up to 4x compared to FCFS.
arXiv Detail & Related papers (2026-01-29T14:14:16Z) - Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models [97.55009021098554]
This work aims to identify the key determinants of SLMs' real-device latency and offer generalizable principles and methodologies for SLM design and training.<n>We introduce a new family of hybrid SLMs, called Nemotron-Flash, which significantly advances the accuracy-efficiency frontier of state-of-the-art SLMs.
arXiv Detail & Related papers (2025-11-24T08:46:36Z) - Intra-request branch orchestration for efficient LLM reasoning [52.68946975865865]
Large Language Models (LLMs) increasingly rely on inference-time reasoning algorithms to improve accuracy on complex tasks.<n>Prior work has largely focused on reducing token usage, often at the expense of accuracy, while overlooking other latency factors.<n>We present DUCHESS, an LLM serving system that reduces cost and latency without sacrificing accuracy through intra-request branch orchestration guided by predictions.
arXiv Detail & Related papers (2025-09-29T15:52:08Z) - Dynamic Speculative Agent Planning [57.630218933994534]
Large language-model-based agents face critical deployment challenges due to prohibitive latency and inference costs.<n>We introduce Dynamic Speculative Planning (DSP), an online reinforcement learning framework that provides lossless acceleration with substantially reduced costs.<n>Experiments on two standard agent benchmarks demonstrate that DSP achieves comparable efficiency to the fastest acceleration method while reducing total cost by 30% and unnecessary cost up to 60%.
arXiv Detail & Related papers (2025-09-02T03:34:36Z) - HyperFlexis: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling [19.154782641360253]
Modern large language model (LLM) serving systems face challenges from highly variable requests with diverse lengths, priorities, and stage-specific service-level objectives (SLOs)<n>We present HyperFlexis, a unified LLM serving system that integrates algorithmic and system-level innovations to jointly optimize scheduling and scaling under multiple SLOs.
arXiv Detail & Related papers (2025-08-21T18:40:20Z) - Adaptive Request Scheduling for CodeLLM Serving with SLA Guarantees [6.110847503516972]
Existing Large Language Models (CodeMs) are increasingly integrated into modern software development.<n>Yet, self-hosted environments remain a significant challenge in resource-constrained serving environments.<n>We propose SABER, a dynamic strategy that predicts per-request SLA feasibility and decisions in real time.<n>Our results demonstrate that SLA-aware, adaptive scheduling is key to robust, high-performance CodeLL serving.
arXiv Detail & Related papers (2025-06-24T14:44:33Z) - Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving [22.66354939370058]
Apt-Serve is a framework designed to enhance effective throughput in large language model (LLM) inference serving systems.<n>A new hybrid cache scheme combines KV cache with a memory-efficient hidden cache for reusable input hidden state vectors, allowing large batch sizes and improving request.<n>We show that Apt-Serve achieves up to 8.8x improvement in effective throughput compared to the state-of-the-art inference serving systems.
arXiv Detail & Related papers (2025-04-10T06:51:23Z) - AdaServe: Accelerating Multi-SLO LLM Serving with SLO-Customized Speculative Decoding [12.106234303559571]
We present AdaServe, the first serving system designed to support efficient multi-SLO serving through SLO-customized speculative decoding.<n>AdaServe formulates multi-SLO serving as a constrained optimization problem and introduces a hardware-aware algorithm.<n>It features a speculate-select-verify pipeline that enables fine-grained control over decoding speed while maximizing system throughput.
arXiv Detail & Related papers (2025-01-21T14:15:01Z) - ConServe: Fine-Grained GPU Harvesting for LLM Online and Offline Co-Serving [61.35068981176018]
ConServe is a large language model (LLM) serving system that achieves high throughput and strong online latency guarantees.<n>We show that ConServe delivers an average of 2.2$times$ higher throughput and reduces online serving tail latency by 2.9$times$ on average compared to state-of-the-art systems.
arXiv Detail & Related papers (2024-10-02T04:12:13Z) - Federated Learning of Large Language Models with Parameter-Efficient
Prompt Tuning and Adaptive Optimization [71.87335804334616]
Federated learning (FL) is a promising paradigm to enable collaborative model training with decentralized data.
The training process of Large Language Models (LLMs) generally incurs the update of significant parameters.
This paper proposes an efficient partial prompt tuning approach to improve performance and efficiency simultaneously.
arXiv Detail & Related papers (2023-10-23T16:37:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.