Related papers: Semantic-Aware Scheduling for GPU Clusters with Large Language Models

Semantic-Aware Scheduling for GPU Clusters with Large Language Models

URL: http://arxiv.org/abs/2510.03334v1
Date: Thu, 02 Oct 2025 02:01:02 GMT
Title: Semantic-Aware Scheduling for GPU Clusters with Large Language Models
Authors: Zerui Wang, Qinghao Hu, Ana Klimovic, Tianwei Zhang, Yonggang Wen, Peng Sun, Dahua Lin,
Abstract summary: We propose SchedMate, a framework that bridges the semantic gap between schedulers and jobs they manage.<n>SchedMate extracts deep insights from overlooked, unstructured data sources: source code, runtime logs, and historical jobs.<n>We show SchedMate reduces average job completion times by up to 1.91x, substantially enhancing the scheduling performance.
Score: 60.14838697778884
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep learning (DL) schedulers are pivotal in optimizing resource allocation in GPU clusters, but operate with a critical limitation: they are largely blind to the semantic context of the jobs they manage. This forces them to rely on limited metadata, leading to high profiling overhead, unreliable duration estimation, inadequate failure handling, and poor observability. To this end, we propose SchedMate, a framework that bridges this semantic gap by systematically extracting deep insights from overlooked, unstructured data sources: source code, runtime logs, and historical jobs. SchedMate enhances existing schedulers non-intrusively through three LLM-based components. Our implementation integrates seamlessly with existing deep learning schedulers. Evaluations on a 128-GPU physical cluster and extensive simulations on production traces show SchedMate reduces average job completion times by up to 1.91x, substantially enhancing the scheduling performance, demonstrating the critical role of semantic-awareness in modern DL scheduling.

Related papers

Morphis: SLO-Aware Resource Scheduling for Microservices with Time-Varying Call Graphs [26.269214281433364]
We propose Morphis, a dependency-aware framework that unifies pattern-aware trace analysis with global optimization.<n>Our evaluations on the TrainTicket benchmark demonstrate that Morphis reduces CPU consumption by 35-38% compared to state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-01T06:04:19Z)
ProfInfer: An eBPF-based Fine-Grained LLM Inference Profiler [4.191309912359899]
We develop a fine-grained, non-intrusive profiling framework for modern inference engines.<n>Our system attaches probes to runtime functions across multiple layers -- without modifying or recompiling the source.<n>It transforms collected traces into rich visualizations of operators, graphs, timelines, and hardware counter trends.
arXiv Detail & Related papers (2026-01-28T16:39:38Z)
Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters [0.8445876768837571]
We present RLTune, an application-agnostic reinforcement learning (RL)-based scheduling framework that dynamically prioritizes and allocates deep learning jobs on heterogeneous GPU clusters.<n>RLTune improves GPU utilization by up to 20%, reduces queueing delay by up to 81%, and shortens JCT by as much as 70 percent.<n>Unlike prior approaches, RLTune generalizes across diverse workloads without requiring per-job profiling.
arXiv Detail & Related papers (2025-12-11T04:19:44Z)
Beyond In-Context Learning: Aligning Long-form Generation of Large Language Models via Task-Inherent Attribute Guidelines [71.14354526117958]
In-context learning (ICL) is an important yet not fully understood ability of pre-trained large language models (LLMs)<n>We present LongGuide, which efficiently generates two parallel streams of guidelines capturing task language and format properties.<n>LongGuide automatically selects the best combination of guidelines, improving both strong open- and closed-source LLMs by over 5% in both zero- and few-shot settings.
arXiv Detail & Related papers (2025-06-02T02:35:24Z)
NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding [54.88765757043535]
This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference.<n>Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types with less than 7% computational overhead.<n>The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search.
arXiv Detail & Related papers (2025-05-28T20:43:10Z)
Resource Heterogeneity-Aware and Utilization-Enhanced Scheduling for Deep Learning Clusters [26.874684454125152]
We propose a task-level scheduler, Hadar, based on an optimization framework that can boost resource utilization.<n>Hadar accelerates the total time duration by 1.20x when compared with its state-of-the-art counterpart, Gavel.<n>HadarE exhibits considerable speed-ups in DL model training, reducing the total time duration by 50% (or 80%) on an Amazon's AWS (or our lab) cluster.
arXiv Detail & Related papers (2025-03-13T22:13:20Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs. We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z)
Graph-enhanced Large Language Models in Asynchronous Plan Reasoning [18.402877904882107]
We find that large language models (LLMs) behave poorly when not supplied with illustrations about the task-solving process in our benchmark AsyncHow. We propose a novel technique called Plan Like a Graph (PLaG) that combines graphs with natural language prompts and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-02-05T08:26:33Z)
GPU Cluster Scheduling for Network-Sensitive Deep Learning [19.344426053952464]
We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads. Our scheduler consists of three major components: (i) a classical delay scheduling algorithm to facilitate job placement and consolidation; (ii) a network-sensitive job preemption strategy; and (iii) an "auto-tuner" mechanism to optimize delay timers for effective delay scheduling.
arXiv Detail & Related papers (2024-01-29T19:06:08Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.