Semantic Scheduling for LLM Inference
- URL: http://arxiv.org/abs/2506.12204v1
- Date: Fri, 13 Jun 2025 20:15:58 GMT
- Title: Semantic Scheduling for LLM Inference
- Authors: Wenyue Hua, Dujian Ding, Yile Gu, Yujie Ren, Kai Mei, Minghua Ma, William Yang Wang,
- Abstract summary: We introduce the concept of semantic scheduling in scheduling of requests from large language models (LLM)<n>We present a novel scheduling algorithm with optimal time complexity, designed to minimize the overall waiting time in LLM-based prompt scheduling.
- Score: 48.19648297172146
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conventional operating system scheduling algorithms are largely content-ignorant, making decisions based on factors such as latency or fairness without considering the actual intents or semantics of processes. Consequently, these algorithms often do not prioritize tasks that require urgent attention or carry higher importance, such as in emergency management scenarios. However, recent advances in language models enable semantic analysis of processes, allowing for more intelligent and context-aware scheduling decisions. In this paper, we introduce the concept of semantic scheduling in scheduling of requests from large language models (LLM), where the semantics of the process guide the scheduling priorities. We present a novel scheduling algorithm with optimal time complexity, designed to minimize the overall waiting time in LLM-based prompt scheduling. To illustrate its effectiveness, we present a medical emergency management application, underscoring the potential benefits of semantic scheduling for critical, time-sensitive tasks. The code and data are available at https://github.com/Wenyueh/latency_optimization_with_priority_constraints.
Related papers
- Haste Makes Waste: Evaluating Planning Abilities of LLMs for Efficient and Feasible Multitasking with Time Constraints Between Actions [56.88110850242265]
We present Recipe2Plan, a novel benchmark framework based on real-world cooking scenarios.<n>Unlike conventional benchmarks, Recipe2Plan challenges agents to optimize cooking time through parallel task execution.
arXiv Detail & Related papers (2025-03-04T03:27:02Z) - Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights [49.42133807824413]
We examine the reasoning and planning capabilities of large language models (LLMs) in solving complex tasks.<n>Recent advances in inference-time techniques demonstrate the potential to enhance LLM reasoning without additional training.<n>OpenAI's o1 model shows promising performance through its novel use of multi-step reasoning and verification.
arXiv Detail & Related papers (2025-02-18T04:11:29Z) - Efficiently Scaling LLM Reasoning with Certaindex [25.549811985276488]
Test-time reasoning algorithms can wastefully generate many tokens without improving accuracy.<n>We introduce Certaindex, an algorithm-agnostic metric measuring when further computation is unlikely to alter the final result.<n>Certaindex is lightweight, can accelerate reasoning program inference via early exit, and enables dynamic token allocation.
arXiv Detail & Related papers (2024-12-30T14:57:53Z) - Dependency-Aware CAV Task Scheduling via Diffusion-Based Reinforcement Learning [12.504232513881828]
We propose a novel dependency-aware task scheduling strategy for dynamic unmanned aerial vehicle-assisted connected autonomous vehicles (CAVs)<n>We formulate a joint scheduling priority and subtask assignment optimization problem with the objective of minimizing the average task completion time.<n>We propose a diffusion-based reinforcement learning algorithm, named Synthetic DDQN based Subtasks Scheduling, which can make adaptive task scheduling decision in real time.
arXiv Detail & Related papers (2024-11-27T11:07:31Z) - Unlocking Reasoning Potential in Large Langauge Models by Scaling Code-form Planning [94.76546523689113]
We introduce CodePlan, a framework that generates and follows textcode-form plans -- pseudocode that outlines high-level, structured reasoning processes.
CodePlan effectively captures the rich semantics and control flows inherent to sophisticated reasoning tasks.
It achieves a 25.1% relative improvement compared with directly generating responses.
arXiv Detail & Related papers (2024-09-19T04:13:58Z) - The Road Less Scheduled [45.01813613035411]
Existing learning rate schedules that do not require specification of the optimization stopping step T are greatly out-performed by learning rate schedules that depend on T.
We propose an approach that avoids the need for this stopping time by eschewing the use of schedules entirely.
Our Schedule-Free approach introduces no additional hyper- parameters over standard schedules with momentum.
arXiv Detail & Related papers (2024-05-24T16:20:46Z) - On efficient computation in active inference [1.1470070927586016]
We present a novel planning algorithm for finite temporal horizons with drastically lower computational complexity.
We also simplify the process of setting an appropriate target distribution for new and existing active inference planning schemes.
arXiv Detail & Related papers (2023-07-02T07:38:56Z) - Semantic-aware Transmission Scheduling: a Monotonicity-driven Deep
Reinforcement Learning Approach [39.681075180578986]
For cyber-physical systems in the 6G era, semantic communications are required to guarantee application-level performance.
In this paper, we first investigate the fundamental properties of the optimal semantic-aware scheduling policy.
We then develop advanced deep reinforcement learning (DRL) algorithms by leveraging the theoretical guidelines.
arXiv Detail & Related papers (2023-05-23T05:45:22Z) - Common Language for Goal-Oriented Semantic Communications: A Curriculum
Learning Framework [66.81698651016444]
A comprehensive semantic communications framework is proposed for enabling goal-oriented task execution.
A novel top-down framework that combines curriculum learning (CL) and reinforcement learning (RL) is proposed to solve this problem.
Simulation results show that the proposed CL method outperforms traditional RL in terms of convergence time, task execution time, and transmission cost during training.
arXiv Detail & Related papers (2021-11-15T19:13:55Z) - Better than the Best: Gradient-based Improper Reinforcement Learning for
Network Scheduling [60.48359567964899]
We consider the problem of scheduling in constrained queueing networks with a view to minimizing packet delay.
We use a policy gradient based reinforcement learning algorithm that produces a scheduler that performs better than the available atomic policies.
arXiv Detail & Related papers (2021-05-01T10:18:34Z) - Scheduling Plans of Tasks [0.0]
We present a algorithm for solving the problem of scheduling plans of tasks.
The proposed algorithm searches for a feasible schedule that maximize the number of plans scheduled.
arXiv Detail & Related papers (2021-02-06T10:14:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.