Optimal Scheduling Algorithms for LLM Inference: Theory and Practice
- URL: http://arxiv.org/abs/2508.01002v1
- Date: Fri, 01 Aug 2025 18:12:21 GMT
- Title: Optimal Scheduling Algorithms for LLM Inference: Theory and Practice
- Authors: Agrim Bari, Parikshit Hegde, Gustavo de Veciana,
- Abstract summary: This paper develops a theoretical framework that models routing and scheduling in Large Language Model inference systems.<n>We identify two key design principles-optimal tiling and dynamic resource allocation-that are essential for achieving high throughput.<n>We show that the Resource-Aware Dynamic (RAD) scheduler achieves throughput optimality under mild conditions.
- Score: 6.043830060363904
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the growing use of Large Language Model (LLM)-based tools like ChatGPT, Perplexity, and Gemini across industries, there is a rising need for efficient LLM inference systems. These systems handle requests with a unique two-phase computation structure: a prefill-phase that processes the full input prompt and a decode-phase that autoregressively generates tokens one at a time. This structure calls for new strategies for routing and scheduling requests. In this paper, we take a comprehensive approach to this challenge by developing a theoretical framework that models routing and scheduling in LLM inference systems. We identify two key design principles-optimal tiling and dynamic resource allocation-that are essential for achieving high throughput. Guided by these principles, we propose the Resource-Aware Dynamic (RAD) scheduler and prove that it achieves throughput optimality under mild conditions. To address practical Service Level Objectives (SLOs) such as serving requests with different Time Between Token (TBT) constraints, we design the SLO-Aware LLM Inference (SLAI) scheduler. SLAI uses real-time measurements to prioritize decode requests that are close to missing their TBT deadlines and reorders prefill requests based on known prompt lengths to further reduce the Time To First Token (TTFT) delays. We evaluate SLAI on the Openchat ShareGPT4 dataset using the Mistral-7B model on an NVIDIA RTX ADA 6000 GPU. Compared to Sarathi-Serve, SLAI reduces the median TTFT by 53% and increases the maximum serving capacity by 26% such that median TTFT is below 0.5 seconds, while meeting tail TBT latency constraints.
Related papers
- How to Train Your LLM Web Agent: A Statistical Diagnosis [102.04125085041473]
We present the first statistically grounded study on compute allocation for LLM web-agent post-training.<n>Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT) and on-policy reinforcement learning.<n>Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++.
arXiv Detail & Related papers (2025-07-05T17:12:33Z) - Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z) - ELIS: Efficient LLM Iterative Scheduling System with Response Length Predictor [5.097511974401423]
ELIS is a serving system for Large Language Models (LLMs) featuring an Iterative Shortest Remaining Time First (ISRTF) scheduler.<n>ISRTF scheduler efficiently manages inference tasks with the shortest remaining time.
arXiv Detail & Related papers (2025-05-14T04:50:00Z) - Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints [14.341123057506827]
Large Language Models (LLMs) are indispensable in today's applications, but their inference procedure demands significant computational resources.<n>This paper formulates LLM inference optimization as a multi-stage online scheduling problem.<n>We develop a fluid dynamics approximation to provide a tractable benchmark that guides algorithm design.
arXiv Detail & Related papers (2025-04-15T16:00:21Z) - SCoTT: Strategic Chain-of-Thought Tasking for Wireless-Aware Robot Navigation in Digital Twins [78.53885607559958]
We propose SCoTT, a wireless-aware path planning framework.<n>We show that SCoTT achieves path gains within 2% of DP-WA* while consistently generating shorter trajectories.<n>We also show the practical viability of our approach by deploying SCoTT as a ROS node within Gazebo simulations.
arXiv Detail & Related papers (2024-11-27T10:45:49Z) - ALISE: Accelerating Large Language Model Serving with Speculative Scheduling [7.367068885621016]
Large Language Models (LLMs) represent a revolutionary advancement in the contemporary landscape of artificial general intelligence (AGI)
In this paper, we propose a new efficient LLM inference serving framework, named ALISE.
We show that ALISE improves the throughput of inference serving by up to 1.8x and 2.1x under the same latency constraint on the Alpaca and ShareGPT datasets, respectively.
arXiv Detail & Related papers (2024-10-31T00:58:11Z) - FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Don't Stop Me Now: Embedding Based Scheduling for LLMs [22.099820814682513]
Size-based scheduling algorithms like Shortest Remaining Process Time (SRPT) aim to reduce average request completion time.
We propose a prediction-based SRPT variant with limited preemption designed to account for memory overhead in LLM systems.
arXiv Detail & Related papers (2024-10-01T19:51:07Z) - Efficient LLM Scheduling by Learning to Rank [19.33941579312897]
We show that it is possible to predict the relative ranks of output lengths in a batch of requests, using learning to rank.
We develop a novel scheduler for LLM inference and serving that can approximate the shortest-job-first (SJF) schedule better than existing approaches.
arXiv Detail & Related papers (2024-08-28T13:35:54Z) - BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models.
It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z) - Better than the Best: Gradient-based Improper Reinforcement Learning for
Network Scheduling [60.48359567964899]
We consider the problem of scheduling in constrained queueing networks with a view to minimizing packet delay.
We use a policy gradient based reinforcement learning algorithm that produces a scheduler that performs better than the available atomic policies.
arXiv Detail & Related papers (2021-05-01T10:18:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.