Related papers: Resource Heterogeneity-Aware and Utilization-Enhanced Scheduling for Deep Learning Clusters

Resource Heterogeneity-Aware and Utilization-Enhanced Scheduling for Deep Learning Clusters

URL: http://arxiv.org/abs/2503.10918v1
Date: Thu, 13 Mar 2025 22:13:20 GMT
Title: Resource Heterogeneity-Aware and Utilization-Enhanced Scheduling for Deep Learning Clusters
Authors: Abeda Sultana, Nabin Pakka, Fei Xu, Xu Yuan, Li Chen, Nian-Feng Tzeng,
Abstract summary: em Hadar is a task-level scheduler based on an optimization framework that can boost resource utilization.<n>em HadarE exhibits considerable speed-ups in DL model training, reducing the total time duration by 50% (or 80%) on an Amazon's AWS (or our lab) cluster.
Score: 26.874684454125152
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scheduling deep learning (DL) models to train on powerful clusters with accelerators like GPUs and TPUs, presently falls short, either lacking fine-grained heterogeneity awareness or leaving resources substantially under-utilized. To fill this gap, we propose a novel design of a task-level heterogeneity-aware scheduler, {\em Hadar}, based on an optimization framework that can boost resource utilization. {\em Hadar} leverages the performance traits of DL jobs on a heterogeneous DL cluster, characterizes the task-level performance heterogeneity in the optimization problem, and makes scheduling decisions across both spatial and temporal dimensions. %with the objective to reduce the average job completion time of DL jobs. It involves the primal-dual framework employing a dual subroutine, to solve the optimization problem and guide the scheduling design. Our trace-driven simulation with representative DL model training workloads demonstrates that {\em Hadar} accelerates the total time duration by 1.20$\times$ when compared with its state-of-the-art heterogeneity-aware counterpart, Gavel. Further, our {\em Hadar} scheduler is enhanced to {\em HadarE} by forking each job into multiple copies to let a job train concurrently on heterogeneous GPUs resided on separate available nodes (i.e., machines or servers) for resource utilization enhancement. {\em HadarE} is evaluated extensively on physical DL clusters for comparison with {\em Hadar} and Gavel. With substantial enhancement in cluster resource utilization (by 1.45$\times$), {\em HadarE} exhibits considerable speed-ups in DL model training, reducing the total time duration by 50\% (or 80\%) on an Amazon's AWS (or our lab) cluster, while producing trained DL models with consistently better inference quality than those trained by \textit{Hadar}.

Related papers

StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation [55.75008325187133]
Reinforcement learning (RL) has become the core post-training technique for large language models (LLMs) StreamRL is designed with disaggregation from first principles to address two types of performance bottlenecks. Experiments show that StreamRL improves throughput by up to 2.66x compared to existing state-of-the-art systems.
arXiv Detail & Related papers (2025-04-22T14:19:06Z)
AutoHete: An Automatic and Efficient Heterogeneous Training System for LLMs [68.99086112477565]
Transformer-based large language models (LLMs) have demonstrated exceptional capabilities in sequence modeling and text generation.<n>Existing heterogeneous training methods significantly expand the scale of trainable models but introduce substantial communication overheads and CPU workloads.<n>We propose AutoHete, an automatic and efficient heterogeneous training system compatible with both single- GPU and multi- GPU environments.
arXiv Detail & Related papers (2025-02-27T14:46:22Z)
Prediction-Assisted Online Distributed Deep Learning Workload Scheduling in GPU Clusters [24.845122459974466]
This paper proposes an adaptive shortest-remaining-processing-time-first (A-SRPT) scheduling algorithm.<n>By modeling each job as a graph corresponding to heterogeneous Deep Neural Network (DNN) models, A-SRPT strategically assigns jobs to the available GPU.<n>A-SRPT maps the complex scheduling problem into a single-machine instance, which is addressed optimally by a preemptive "shortest-remaining-processing-time-first" strategy.
arXiv Detail & Related papers (2025-01-09T20:19:01Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs. We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z)
Taming Resource Heterogeneity In Distributed ML Training With Dynamic Batching [1.047192732651018]
Current techniques for distributed model training mostly assume that clusters are comprised of servers with a constant resource availability. We develop a dynamic technique for distributed data-parallel training that adjusts the mini-batch sizes on each worker based on availability and throughput.
arXiv Detail & Related papers (2023-05-20T15:33:06Z)
COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training [42.514897110537596]
Modern Deep Learning (DL) models have grown to sizes requiring massive clusters of specialized, high-end nodes to train. designing such clusters to maximize both performance and utilization--to amortize their steep cost--is a challenging task. We introduce COMET, a holistic cluster design methodology and workflow to jointly study the impact of parallelization strategies and key cluster resource provisioning on the performance of distributed DL training.
arXiv Detail & Related papers (2022-11-30T00:32:37Z)
Doing More by Doing Less: How Structured Partial Backpropagation Improves Deep Learning Clusters [9.17259958324486]
Training deep learning models is resource-intensive, consuming significant compute, memory, and network resources. We propose Structured Partial Backpropagation(SPB), a technique that controls the amount of backpropagation at individual workers in distributed training. We find that JigSaw can improve large scale cluster efficiency by as high as 28%.
arXiv Detail & Related papers (2021-11-20T20:34:26Z)
Gradient Coding with Dynamic Clustering for Straggler-Tolerant Distributed Learning [55.052517095437]
gradient descent (GD) is widely employed to parallelize the learning task by distributing the dataset across multiple workers. A significant performance bottleneck for the per-iteration completion time in distributed synchronous GD is $straggling$ workers. Coded distributed techniques have been introduced recently to mitigate stragglers and to speed up GD iterations by assigning redundant computations to workers. We propose a novel dynamic GC scheme, which assigns redundant data to workers to acquire the flexibility to choose from among a set of possible codes depending on the past straggling behavior.
arXiv Detail & Related papers (2021-03-01T18:51:29Z)
Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models [8.055533378391814]
We show that single-accelerator training jobs can dominate the cluster-wide resource consumption when launched repetitively. We propose Horizontally Fused Training Array (HFTA) to help DL researchers and practitioners effectively and easily improve the hardware utilization of their novel DL training workloads. HFTA demonstrates strong effectiveness in squeezing out hardware utilization and achieves up to $15.1 times$ higher training throughput vs. the standard practice of running each job on a separate accelerator.
arXiv Detail & Related papers (2021-02-03T23:56:55Z)
Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning [61.29990368322931]
Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors. Pollux reduces average job completion times by 37-50% relative to state-of-the-art DL schedulers.
arXiv Detail & Related papers (2020-08-27T16:56:48Z)
Effective Elastic Scaling of Deep Learning Workloads [3.345876096131764]
We examine the elastic scaling of Deep Learning (DL) jobs over large-scale training platforms. We propose a novel resource allocation strategy for DL training jobs, resulting in improved job run time performance as well as increased cluster utilization.
arXiv Detail & Related papers (2020-06-24T17:01:09Z)
Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models. This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.