PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models
- URL: http://arxiv.org/abs/2602.11530v1
- Date: Thu, 12 Feb 2026 03:40:44 GMT
- Title: PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models
- Authors: Eunyeong Cho, Jehyeon Bang, Ranggi Hwang, Minsoo Rhu,
- Abstract summary: We present PASCAL, a phase-aware scheduling algorithm that prioritizes reasoning to reduce TTFT while using controlled preemption and token pacing during answering to preserve Quality-of-Experience (QoE)<n>Our hierarchical scheduler combines instance-level placement with intra-instance execution to balance load and reduce interference.<n>Across benchmarks using DeepSeek-R1-Distill-Qwen-32B, PASCAL reduces tail TTFT by up to 72% while maintaining answering phase SLO attainment.
- Score: 3.088398451509366
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The emergence of reasoning-based LLMs leveraging Chain-of-Thought (CoT) inference introduces new serving challenges, as their extended reasoning phases delay user-visible output and inflate Time-To-First-Token (TTFT). Existing LLM serving frameworks fail to distinguish between reasoning and answering phases, leading to performance degradation under GPU memory constraints. We present PASCAL, a phase-aware scheduling algorithm that prioritizes reasoning to reduce TTFT while using controlled preemption and token pacing during answering to preserve Quality-of-Experience (QoE). Our hierarchical scheduler combines instance-level placement with intra-instance execution and enables dynamic migration at phase boundaries to balance load and reduce interference. Across benchmarks using DeepSeek-R1-Distill-Qwen-32B, PASCAL reduces tail TTFT by up to 72% while maintaining answering phase SLO attainment, demonstrating the importance of phase-aware scheduling for reasoning-based LLM deployment.
Related papers
- A State-Transition Framework for Efficient LLM Reasoning [58.18141262230392]
Long Chain-of-Thought (CoT) reasoning significantly improves Large Language Models (LLMs) performance on complex reasoning tasks.<n>Existing studies usually enhance the reasoning efficiency of LLMs by compressing CoT sequences.<n>We propose an efficient reasoning framework that models the reasoning process of LLMs as a state-transition process.
arXiv Detail & Related papers (2026-02-01T12:40:40Z) - DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training [22.898073682504023]
In widely used attention implementations such as FlashAttention-3, the deterministic backward pass can incur up to a 37.9% throughput reduction.<n>We formulate the backward pass of deterministic attention as a scheduling problem on a Directed Acyclic Graph (DAG)<n>We present DASH (Deterministic Attention Scheduling for High-Throughput), which encapsulates two complementary scheduling strategies.
arXiv Detail & Related papers (2026-01-29T15:10:13Z) - Probe and Skip: Self-Predictive Token Skipping for Efficient Long-Context LLM Inference [29.81657023400426]
Token-oriented methods, such as pruning and skipping, have shown promise in reducing inference latency.<n>We propose SPTS (Self-Predictive Token Skipping), a training-free framework for efficient long-context inference.
arXiv Detail & Related papers (2026-01-19T15:34:29Z) - FairBatching: Fairness-Aware Batch Formation for LLM Inference [2.0917668141703207]
This work identifies the root cause of this unfairness: the non-monotonic nature of Time--Tokens (TBT)<n>We propose Fair the Prioritizing, a novel system that enforces fair resource allocation between fill and decode tasks.
arXiv Detail & Related papers (2025-10-16T07:43:56Z) - Beyond Surface Reasoning: Unveiling the True Long Chain-of-Thought Capacity of Diffusion Large Language Models [54.81955614221652]
parallel decoding, which enables simultaneous token updates, conflicts with the causal order often required for rigorous reasoning.<n> Behavioral analyses in both simple and complex reasoning tasks show thatDLLMs exhibit genuine parallelism only for directly decidable outputs.<n>We propose several practical mitigations, parallel-oriented prompting, diffusion early stopping, and parallel scaling, to reduce PSC-induced ineffectiveness and inefficiencies.
arXiv Detail & Related papers (2025-10-10T16:58:14Z) - Intra-request branch orchestration for efficient LLM reasoning [52.68946975865865]
Large Language Models (LLMs) increasingly rely on inference-time reasoning algorithms to improve accuracy on complex tasks.<n>Prior work has largely focused on reducing token usage, often at the expense of accuracy, while overlooking other latency factors.<n>We present DUCHESS, an LLM serving system that reduces cost and latency without sacrificing accuracy through intra-request branch orchestration guided by predictions.
arXiv Detail & Related papers (2025-09-29T15:52:08Z) - Prompt-Aware Scheduling for Low-Latency LLM Serving [4.410280212028576]
We introduce PARS, a prompt-aware LLM task scheduler.<n>It approximats shortest-job-first (SJF) scheduling through pairwise ranking with margin ranking loss.<n>It effectively predicts response-length-based task ordering, reducing latency with minimal overhead.
arXiv Detail & Related papers (2025-09-25T07:26:38Z) - CSGO: Generalized Optimization for Cold Start in Wireless Collaborative Edge LLM Systems [62.24576366776727]
We propose a latency-aware scheduling framework to minimize total inference latency.<n>We show that the proposed method significantly reduces cold-start latency compared to baseline strategies.
arXiv Detail & Related papers (2025-08-15T07:49:22Z) - R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning [80.104336426172]
Chain-of-thought (CoT) enhances problem-solving ability of large language models.<n>CoT incurs substantial inference cost due to long autoregressive trajectories.<n>We introduce R-Stitch, a training-free hybrid decoding framework.
arXiv Detail & Related papers (2025-07-23T08:14:36Z) - LLM-Symbolic Integration for Robust Temporal Tabular Reasoning [69.27153114778748]
We introduce TempTabQA-C, a synthetic dataset designed for systematic and controlled evaluations.<n>This structured approach allows Large Language Models (LLMs) to generate and executesql queries, enhancing generalization and mitigating biases.
arXiv Detail & Related papers (2025-06-06T05:14:04Z) - ALISE: Accelerating Large Language Model Serving with Speculative Scheduling [7.367068885621016]
Large Language Models (LLMs) represent a revolutionary advancement in the contemporary landscape of artificial general intelligence (AGI)
In this paper, we propose a new efficient LLM inference serving framework, named ALISE.
We show that ALISE improves the throughput of inference serving by up to 1.8x and 2.1x under the same latency constraint on the Alpaca and ShareGPT datasets, respectively.
arXiv Detail & Related papers (2024-10-31T00:58:11Z) - Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification [76.14641982122696]
We propose a constraint learning schema for fine-tuning Large Language Models (LLMs) with attribute control.
We show that our approach leads to an LLM that produces fewer inappropriate responses while achieving competitive performance on benchmarks and a toxicity detection task.
arXiv Detail & Related papers (2024-10-07T23:38:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.