Related papers: Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters

Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters

URL: http://arxiv.org/abs/2512.10271v1
Date: Thu, 11 Dec 2025 04:19:44 GMT
Title: Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters
Authors: Shruti Dongare, Redwan Ibne Seraj Khan, Hadeel Albahar, Nannan Zhao, Diego Melendez Maita, Ali R. Butt,
Abstract summary: We present RLTune, an application-agnostic reinforcement learning (RL)-based scheduling framework that dynamically prioritizes and allocates deep learning jobs on heterogeneous GPU clusters.<n>RLTune improves GPU utilization by up to 20%, reduces queueing delay by up to 81%, and shortens JCT by as much as 70 percent.<n>Unlike prior approaches, RLTune generalizes across diverse workloads without requiring per-job profiling.
Score: 0.8445876768837571
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern cloud platforms increasingly host large-scale deep learning (DL) workloads, demanding high-throughput, low-latency GPU scheduling. However, the growing heterogeneity of GPU clusters and limited visibility into application characteristics pose major challenges for existing schedulers, which often rely on offline profiling or application-specific assumptions. We present RLTune, an application-agnostic reinforcement learning (RL)-based scheduling framework that dynamically prioritizes and allocates DL jobs on heterogeneous GPU clusters. RLTune integrates RL-driven prioritization with MILP-based job-to-node mapping to optimize system-wide objectives such as job completion time (JCT), queueing delay, and resource utilization. Trained on large-scale production traces from Microsoft Philly, Helios, and Alibaba, RLTune improves GPU utilization by up to 20%, reduces queueing delay by up to 81%, and shortens JCT by as much as 70 percent. Unlike prior approaches, RLTune generalizes across diverse workloads without requiring per-job profiling, making it practical for cloud providers to deploy at scale for more efficient, fair, and sustainable DL workload management.

Related papers

Data Driven Optimization of GPU efficiency for Distributed LLM Adapter Serving [2.6336040306318274]
Large Language Model (LLM) adapters enable low-cost model specialization.<n>LLM adapters introduce complex caching and scheduling challenges in distributed serving systems where hundreds of adapters must be hosted concurrently.<n>This paper presents a data-driven pipeline that computes an adapter placement that serves the workload with the minimum number of GPU.
arXiv Detail & Related papers (2026-02-27T14:22:51Z)
RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure [49.88201789074532]
Agentic Reinforcement Learning (RL) enables Large Language Models (LLMs) to perform autonomous decision-making and long-term planning.<n>We present RollArc, a distributed system designed to maximize throughput for multi-task agentic RL on disaggregated infrastructure.
arXiv Detail & Related papers (2025-12-27T11:14:23Z)
Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z)
RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs [48.94639777633359]
We present RLBoost, a systematic solution for cost-efficient RL training that harvests preemptible GPU resources.<n> RLBoost increases training throughput by 1.51x-1.97x while improving cost efficiency by 28%-49% compared to using only on-demand GPU resources.
arXiv Detail & Related papers (2025-10-22T04:19:37Z)
Semantic-Aware Scheduling for GPU Clusters with Large Language Models [60.14838697778884]
We propose SchedMate, a framework that bridges the semantic gap between schedulers and jobs they manage.<n>SchedMate extracts deep insights from overlooked, unstructured data sources: source code, runtime logs, and historical jobs.<n>We show SchedMate reduces average job completion times by up to 1.91x, substantially enhancing the scheduling performance.
arXiv Detail & Related papers (2025-10-02T02:01:02Z)
MACE: A Hybrid LLM Serving System with Colocated SLO-aware Continuous Retraining Alignment [14.392166280035122]
Large language models (LLMs) deployed on edge servers are increasingly used in latency-sensitive applications such as personalized assistants, recommendation, and content moderation.<n>Existing retraining strategies either delay model updates, over-commit resources to retraining, or overlook iteration-level retraining granularity.<n>We propose MACE, a hybrid LLM system that colocates concurrent inference (prefill, decode) and fine-tuning, with intelligent memory management to maximize task performance while promising inference throughput.
arXiv Detail & Related papers (2025-09-28T18:45:28Z)
StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation [55.75008325187133]
Reinforcement learning (RL) has become the core post-training technique for large language models (LLMs)<n>StreamRL is designed with disaggregation from first principles to address two types of performance bottlenecks.<n> Experiments show that StreamRL improves throughput by up to 2.66x compared to existing state-of-the-art systems.
arXiv Detail & Related papers (2025-04-22T14:19:06Z)
Resource Heterogeneity-Aware and Utilization-Enhanced Scheduling for Deep Learning Clusters [26.874684454125152]
We propose a task-level scheduler, Hadar, based on an optimization framework that can boost resource utilization.<n>Hadar accelerates the total time duration by 1.20x when compared with its state-of-the-art counterpart, Gavel.<n>HadarE exhibits considerable speed-ups in DL model training, reducing the total time duration by 50% (or 80%) on an Amazon's AWS (or our lab) cluster.
arXiv Detail & Related papers (2025-03-13T22:13:20Z)
Prediction-Assisted Online Distributed Deep Learning Workload Scheduling in GPU Clusters [24.845122459974466]
This paper proposes an adaptive shortest-remaining-processing-time-first (A-SRPT) scheduling algorithm.<n>By modeling each job as a graph corresponding to heterogeneous Deep Neural Network (DNN) models, A-SRPT strategically assigns jobs to the available GPU.<n>A-SRPT maps the complex scheduling problem into a single-machine instance, which is addressed optimally by a preemptive "shortest-remaining-processing-time-first" strategy.
arXiv Detail & Related papers (2025-01-09T20:19:01Z)
GPU Cluster Scheduling for Network-Sensitive Deep Learning [15.926240223625165]
We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads.<n>Our scheduler consists of three major components: (i) a classical delay scheduling algorithm to facilitate job placement and consolidation; (ii) a network-sensitive job preemption strategy; and (iii) an "auto-tuner" mechanism to optimize delay timers for effective delay scheduling.
arXiv Detail & Related papers (2024-01-29T19:06:08Z)
Efficient Parallel Split Learning over Resource-constrained Wireless Edge Networks [44.37047471448793]
In this paper, we advocate the integration of edge computing paradigm and parallel split learning (PSL) We propose an innovative PSL framework, namely, efficient parallel split learning (EPSL) to accelerate model training. We show that the proposed EPSL framework significantly decreases the training latency needed to achieve a target accuracy.
arXiv Detail & Related papers (2023-03-26T16:09:48Z)
Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning [61.29990368322931]
Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors. Pollux reduces average job completion times by 37-50% relative to state-of-the-art DL schedulers.
arXiv Detail & Related papers (2020-08-27T16:56:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.