Related papers: Effective Elastic Scaling of Deep Learning Workloads

Effective Elastic Scaling of Deep Learning Workloads

URL: http://arxiv.org/abs/2006.13878v1
Date: Wed, 24 Jun 2020 17:01:09 GMT
Title: Effective Elastic Scaling of Deep Learning Workloads
Authors: Vaibhav Saxena, K. R. Jayaram, Saurav Basu, Yogish Sabharwal and Ashish Verma
Abstract summary: We examine the elastic scaling of Deep Learning (DL) jobs over large-scale training platforms. We propose a novel resource allocation strategy for DL training jobs, resulting in improved job run time performance as well as increased cluster utilization.
Score: 3.345876096131764
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The increased use of deep learning (DL) in academia, government and industry has, in turn, led to the popularity of on-premise and cloud-hosted deep learning platforms, whose goals are to enable organizations utilize expensive resources effectively, and to share said resources among multiple teams in a fair and effective manner. In this paper, we examine the elastic scaling of Deep Learning (DL) jobs over large-scale training platforms and propose a novel resource allocation strategy for DL training jobs, resulting in improved job run time performance as well as increased cluster utilization. We begin by analyzing DL workloads and exploit the fact that DL jobs can be run with a range of batch sizes without affecting their final accuracy. We formulate an optimization problem that explores a dynamic batch size allocation to individual DL jobs based on their scaling efficiency, when running on multiple nodes. We design a fast dynamic programming based optimizer to solve this problem in real-time to determine jobs that can be scaled up/down, and use this optimizer in an autoscaler to dynamically change the allocated resources and batch sizes of individual DL jobs. We demonstrate empirically that our elastic scaling algorithm can complete up to $\approx 2 \times$ as many jobs as compared to a strong baseline algorithm that also scales the number of GPUs but does not change the batch size. We also demonstrate that the average completion time with our algorithm is up to $\approx 10 \times$ faster than that of the baseline.

Related papers

Resource Heterogeneity-Aware and Utilization-Enhanced Scheduling for Deep Learning Clusters [26.874684454125152]
em Hadar is a task-level scheduler based on an optimization framework that can boost resource utilization. em HadarE exhibits considerable speed-ups in DL model training, reducing the total time duration by 50% (or 80%) on an Amazon's AWS (or our lab) cluster.
arXiv Detail & Related papers (2025-03-13T22:13:20Z)
ATA: Adaptive Task Allocation for Efficient Resource Management in Distributed Machine Learning [54.08906841213777]
Asynchronous methods are fundamental for parallelizing computations in distributed machine learning. We propose ATA (Adaptive Task Allocation), a method that adapts to heterogeneous and random distributions of computation times. We show that ATA identifies the optimal task allocation and performs comparably to methods with prior knowledge of computation times.
arXiv Detail & Related papers (2025-02-02T12:22:26Z)
Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research. Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration. Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z)
Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs. We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z)
Slapo: A Schedule Language for Progressive Optimization of Large Deep Learning Model Training [17.556432199389615]
Slapo is a schedule language that decouples the execution of a tensor-level operator from its arithmetic definition. We show that Slapo can improve training throughput by up to 2.92x on a single machine with 8 NVIDIA V100 GPUs.
arXiv Detail & Related papers (2023-02-16T00:34:53Z)
Partitioning Distributed Compute Jobs with Reinforcement Learning and Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields. Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices. We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z)
Learning to Optimize Permutation Flow Shop Scheduling via Graph-based Imitation Learning [70.65666982566655]
Permutation flow shop scheduling (PFSS) is widely used in manufacturing systems. We propose to train the model via expert-driven imitation learning, which accelerates convergence more stably and accurately. Our model's network parameters are reduced to only 37% of theirs, and the solution gap of our model towards the expert solutions decreases from 6.8% to 1.3% on average.
arXiv Detail & Related papers (2022-10-31T09:46:26Z)
Singularity: Planet-Scale, Preemptible, Elastic Scheduling of AI Workloads [12.117736592836506]
We present Singularity, Microsoft's globally distributed scheduling service for deep learning training and inference workloads. At the heart of Singularity is a novel, workload-aware scheduler that can transparently preempt and elastically scale deep learning workloads. We show that the resulting efficiency and reliability gains with Singularity are achieved with negligible impact on the steady-state performance.
arXiv Detail & Related papers (2022-02-16T04:02:10Z)
Doing More by Doing Less: How Structured Partial Backpropagation Improves Deep Learning Clusters [9.17259958324486]
Training deep learning models is resource-intensive, consuming significant compute, memory, and network resources. We propose Structured Partial Backpropagation(SPB), a technique that controls the amount of backpropagation at individual workers in distributed training. We find that JigSaw can improve large scale cluster efficiency by as high as 28%.
arXiv Detail & Related papers (2021-11-20T20:34:26Z)
Online Evolutionary Batch Size Orchestration for Scheduling Deep Learning Workloads in GPU Clusters [10.395955671683245]
We propose ONES, an ONline Scheduler for elastic batch size orchestration. ONES automatically manages the elasticity of each job based on the training batch size. We show that ONES can outperform the prior deep learning schedulers with a significantly shorter average job completion time.
arXiv Detail & Related papers (2021-08-08T14:20:05Z)
BFTrainer: Low-Cost Training of Neural Networks on Unfillable Supercomputer Nodes [0.8201100713224002]
FCFS-based scheduling policies result in many transient idle nodes. We show how to realize a novel use for these otherwise wasted resources, namely, deep neural network (DNN) training.
arXiv Detail & Related papers (2021-06-22T22:53:19Z)
Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning [61.29990368322931]
Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors. Pollux reduces average job completion times by 37-50% relative to state-of-the-art DL schedulers.
arXiv Detail & Related papers (2020-08-27T16:56:48Z)
Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models. This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.