Related papers: Adaptive Request Scheduling for CodeLLM Serving with SLA Guarantees

Adaptive Request Scheduling for CodeLLM Serving with SLA Guarantees

URL: http://arxiv.org/abs/2506.19677v2
Date: Wed, 25 Jun 2025 16:13:14 GMT
Title: Adaptive Request Scheduling for CodeLLM Serving with SLA Guarantees
Authors: Shi Chang, Boyuan Chen, Kishanthan Thangarajah, Hanan Lutfiyya, Ahmed E. Hassan,
Abstract summary: Existing Large Language Models (CodeMs) are increasingly integrated into modern software development.<n>Yet, self-hosted environments remain a significant challenge in resource-constrained serving environments.<n>We propose SABER, a dynamic strategy that predicts per-request SLA feasibility and decisions in real time.<n>Our results demonstrate that SLA-aware, adaptive scheduling is key to robust, high-performance CodeLL serving.
Score: 6.110847503516972
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Code Large Language Models (CodeLLMs) are increasingly integrated into modern software development workflows, yet efficiently serving them in resource-constrained, self-hosted environments remains a significant challenge. Existing LLM serving systems employs Continuous Batching for throughput improvement. However, they rely on static batch size configurations that cannot adapt to fluctuating request rates or heterogeneous workloads, leading to frequent SLA (Service Level Agreement) violations and unstable performance. In this study, We propose SABER, a dynamic batching strategy that predicts per-request SLA feasibility and adjusts decisions in real time. SABER improves goodput by up to 26% over the best static configurations and reduces latency variability by up to 45%, all without manual tuning or service restarts. Our results demonstrate that SLA-aware, adaptive scheduling is key to robust, high-performance CodeLLM serving.

Related papers

SLA-MORL: SLA-Aware Multi-Objective Reinforcement Learning for HPC Resource Optimization [0.9026828778470117]
We present SLA-MORL, an adaptive multi-objective reinforcement learning framework that intelligently allocates resources based on user-defined preferences.<n>We show that SLA-MORL achieves 67.2% reduction in training time for deadline-critical jobs, 68.8% reduction in costs for budget-constrained workloads, and 73.4% improvement in overall SLA compliance compared to static baselines.
arXiv Detail & Related papers (2025-08-05T14:37:24Z)
Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs [48.653022530291494]
Large language models (LLMs) have shown remarkable performance across diverse reasoning and generation tasks.<n>This work presents the first systematic study of this latency quality trade off in real time decision making tasks.<n>We propose FPX, an adaptive framework that dynamically selects model size and quantization level based on real time demands.
arXiv Detail & Related papers (2025-05-26T04:03:48Z)
Efficient and Workload-Aware LLM Serving via Runtime Layer Swapping and KV Cache Resizing [15.386746669464964]
MorphServe is a workload-aware LLM serving framework based on morphological adaptation.<n>It reduces average SLO violations by 92.45 percent and improves the P95 TTFT latency by 2.2x-3.9x compared to full-precision serving.
arXiv Detail & Related papers (2025-05-24T06:12:31Z)
DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal [55.13854171147104]
Large Language Models (LLMs) have revolutionized various domains, including natural language processing, data analysis, and software development.<n>We present Dynamic Action Re-Sampling (DARS), a novel inference time compute scaling approach for coding agents.<n>We evaluate our approach on SWE-Bench Lite benchmark, demonstrating that this scaling strategy achieves a pass@k score of 55% with Claude 3.5 Sonnet V2.
arXiv Detail & Related papers (2025-03-18T14:02:59Z)
SpecServe: Efficient and SLO-Aware Large Language Model Serving with Adaptive Speculative Decoding [18.45994543035372]
Speculative decoding has emerged as a compelling technique to accelerate Large Language Model inference.<n>Existing speculative decoding solutions often fail to adapt to varying workloads and system environments.<n>We introduce SpecServe, an efficient LLM inference system that dynamically adjusts speculative strategies according to real-time request loads.
arXiv Detail & Related papers (2025-03-07T02:27:51Z)
LADs: Leveraging LLMs for AI-Driven DevOps [3.240228178267042]
LADs is a principled approach to configuration optimization through in-depth analysis of what optimization works under which conditions.<n>By leveraging Retrieval-Augmented Generation, Few-Shot Learning, Chain-of-Thought, and Feedback-Based Prompt Chaining, LADs generates accurate configurations and learns from deployment failures to iteratively refine system settings.<n>Our findings reveal key insights into the trade-offs between performance, cost, and scalability, helping practitioners determine the right strategies for different deployment scenarios.
arXiv Detail & Related papers (2025-02-28T08:12:08Z)
Dynamic Noise Preference Optimization for LLM Self-Improvement via Synthetic Data [51.62162460809116]
We introduce Dynamic Noise Preference Optimization (DNPO) to ensure consistent improvements across iterations.<n>In experiments with Zephyr-7B, DNPO consistently outperforms existing methods, showing an average performance boost of 2.6%.<n> DNPO shows a significant improvement in model-generated data quality, with a 29.4% win-loss rate gap compared to the baseline in GPT-4 evaluations.
arXiv Detail & Related papers (2025-02-08T01:20:09Z)
Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization [61.02719787737867]
Large language models (LLMs) are increasingly deployed and democratized on edge devices.<n>One promising solution is uncertainty-based SLM routing, offloading high-stakes queries to stronger LLMs when resulting in low-confidence responses on SLM.<n>We conduct a comprehensive investigation into benchmarking and generalization of uncertainty-driven routing strategies from SLMs to LLMs over 1500+ settings.
arXiv Detail & Related papers (2025-02-06T18:59:11Z)
AdaServe: Accelerating Multi-SLO LLM Serving with SLO-Customized Speculative Decoding [12.106234303559571]
We present AdaServe, the first serving system designed to support efficient multi-SLO serving through SLO-customized speculative decoding.<n>AdaServe formulates multi-SLO serving as a constrained optimization problem and introduces a hardware-aware algorithm.<n>It features a speculate-select-verify pipeline that enables fine-grained control over decoding speed while maximizing system throughput.
arXiv Detail & Related papers (2025-01-21T14:15:01Z)
SPEQ: Offline Stabilization Phases for Efficient Q-Learning in High Update-To-Data Ratio Reinforcement Learning [51.10866035483686]
High update-to-data (UTD) ratio algorithms in reinforcement learning (RL) improve sample efficiency but incur high computational costs, limiting real-world scalability.<n>We propose Offline Stabilization Phases for Efficient Q-Learning (SPEQ), an RL algorithm that combines low-UTD online training with periodic offline stabilization phases.<n>During these phases, Q-functions are fine-tuned with high UTD ratios on a fixed replay buffer, reducing redundant updates on suboptimal data.
arXiv Detail & Related papers (2025-01-15T09:04:19Z)
Federated Learning of Large Language Models with Parameter-Efficient Prompt Tuning and Adaptive Optimization [71.87335804334616]
Federated learning (FL) is a promising paradigm to enable collaborative model training with decentralized data. The training process of Large Language Models (LLMs) generally incurs the update of significant parameters. This paper proposes an efficient partial prompt tuning approach to improve performance and efficiency simultaneously.
arXiv Detail & Related papers (2023-10-23T16:37:59Z)
Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems [0.0]
InfAdapter proactively selects a set of ML model variants with their resource allocations to meet latency SLO. It decreases SLO violation and costs up to 65% and 33%, respectively, compared to a popular industry autoscaler.
arXiv Detail & Related papers (2023-04-21T11:19:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.