Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep
Learning
- URL: http://arxiv.org/abs/2008.12260v2
- Date: Wed, 26 May 2021 06:08:21 GMT
- Title: Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep
Learning
- Authors: Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subramanya, Willie
Neiswanger, Qirong Ho, Hao Zhang, Gregory R. Ganger, Eric P. Xing
- Abstract summary: Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors.
Pollux reduces average job completion times by 37-50% relative to state-of-the-art DL schedulers.
- Score: 61.29990368322931
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pollux improves scheduling performance in deep learning (DL) clusters by
adaptively co-optimizing inter-dependent factors both at the per-job level and
at the cluster-wide level. Most existing schedulers expect users to specify the
number of resources for each job, often leading to inefficient resource use.
Some recent schedulers choose job resources for users, but do so without
awareness of how DL training can be re-optimized to better utilize the provided
resources.
Pollux simultaneously considers both aspects. By monitoring the status of
each job during training, Pollux models how their goodput (a novel metric we
introduce that combines system throughput with statistical efficiency) would
change by adding or removing resources. Leveraging these information, Pollux
dynamically (re-)assigns resources to improve cluster-wide goodput, while
respecting fairness and continually optimizing each DL job to better utilize
those resources.
In experiments with real DL jobs and with trace-driven simulations, Pollux
reduces average job completion times by 37-50% relative to state-of-the-art DL
schedulers, even when they are provided with ideal resource and training
configurations for every job. Pollux promotes fairness among DL jobs competing
for resources based on a more meaningful measure of useful job progress, and
reveals a new opportunity for reducing DL cost in cloud environments. Pollux is
implemented and publicly available as part of an open-source project at
https://github.com/petuum/adaptdl.
Related papers
- When Does Visual Prompting Outperform Linear Probing for Vision-Language Models? A Likelihood Perspective [57.05315507519704]
We propose a log-likelihood ratio (LLR) approach to analyze the comparative benefits of visual prompting and linear probing.
Our measure attains up to a 100-fold reduction in run time compared to full training, while achieving prediction accuracies up to 91%.
arXiv Detail & Related papers (2024-09-03T12:03:45Z) - Prune at the Clients, Not the Server: Accelerated Sparse Training in Federated Learning [56.21666819468249]
Resource constraints of clients and communication costs pose major problems for training large models in Federated Learning.
We introduce Sparse-ProxSkip, which combines training and acceleration in a sparse setting.
We demonstrate the good performance of Sparse-ProxSkip in extensive experiments.
arXiv Detail & Related papers (2024-05-31T05:21:12Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - Taming Resource Heterogeneity In Distributed ML Training With Dynamic
Batching [1.047192732651018]
Current techniques for distributed model training mostly assume that clusters are comprised of servers with a constant resource availability.
We develop a dynamic technique for distributed data-parallel training that adjusts the mini-batch sizes on each worker based on availability and throughput.
arXiv Detail & Related papers (2023-05-20T15:33:06Z) - Learning to Optimize Permutation Flow Shop Scheduling via Graph-based
Imitation Learning [70.65666982566655]
Permutation flow shop scheduling (PFSS) is widely used in manufacturing systems.
We propose to train the model via expert-driven imitation learning, which accelerates convergence more stably and accurately.
Our model's network parameters are reduced to only 37% of theirs, and the solution gap of our model towards the expert solutions decreases from 6.8% to 1.3% on average.
arXiv Detail & Related papers (2022-10-31T09:46:26Z) - HeterPS: Distributed Deep Learning With Reinforcement Learning Based
Scheduling in Heterogeneous Environments [37.55572042288321]
Training process of neural networks (DNNs) generally handles large-scale input data with many sparse features.
Paddle-HeterPS is composed of a distributed architecture and a Reinforcement Reinforcement (RL)-based scheduling method.
We show that Paddle-HeterPS significantly outperforms state-of-the-art approaches in terms of throughput (14.5 times higher) and monetary cost (312.3% smaller)
arXiv Detail & Related papers (2021-11-20T17:09:15Z) - Synergy: Resource Sensitive DNN Scheduling in Multi-Tenant Clusters [10.38396444951436]
Training Deep Neural Networks (DNNs) is a widely popular workload in both enterprises and cloud data centers.
We propose Synergy, a resource-sensitive scheduler for shared GPU clusters.
Our experiments show that workload-aware CPU and memory allocations can improve average JCT up to 3.4x when compared to traditional GPU-proportional scheduling.
arXiv Detail & Related papers (2021-10-12T15:25:54Z) - A Predictive Autoscaler for Elastic Batch Jobs [8.354712625979776]
Large batch jobs such as Deep Learning, HPC and Spark require far more computational resources and higher cost than conventional online service.
We propose a predictive autoscaler to provide an elastic interface for the customers and overprovision instances.
arXiv Detail & Related papers (2020-10-10T17:35:55Z) - Effective Elastic Scaling of Deep Learning Workloads [3.345876096131764]
We examine the elastic scaling of Deep Learning (DL) jobs over large-scale training platforms.
We propose a novel resource allocation strategy for DL training jobs, resulting in improved job run time performance as well as increased cluster utilization.
arXiv Detail & Related papers (2020-06-24T17:01:09Z) - Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of
Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models.
This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.