Effective Elastic Scaling of Deep Learning Workloads
- URL: http://arxiv.org/abs/2006.13878v1
- Date: Wed, 24 Jun 2020 17:01:09 GMT
- Title: Effective Elastic Scaling of Deep Learning Workloads
- Authors: Vaibhav Saxena, K. R. Jayaram, Saurav Basu, Yogish Sabharwal and
Ashish Verma
- Abstract summary: We examine the elastic scaling of Deep Learning (DL) jobs over large-scale training platforms.
We propose a novel resource allocation strategy for DL training jobs, resulting in improved job run time performance as well as increased cluster utilization.
- Score: 3.345876096131764
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The increased use of deep learning (DL) in academia, government and industry
has, in turn, led to the popularity of on-premise and cloud-hosted deep
learning platforms, whose goals are to enable organizations utilize expensive
resources effectively, and to share said resources among multiple teams in a
fair and effective manner.
In this paper, we examine the elastic scaling of Deep Learning (DL) jobs over
large-scale training platforms and propose a novel resource allocation strategy
for DL training jobs, resulting in improved job run time performance as well as
increased cluster utilization. We begin by analyzing DL workloads and exploit
the fact that DL jobs can be run with a range of batch sizes without affecting
their final accuracy. We formulate an optimization problem that explores a
dynamic batch size allocation to individual DL jobs based on their scaling
efficiency, when running on multiple nodes. We design a fast dynamic
programming based optimizer to solve this problem in real-time to determine
jobs that can be scaled up/down, and use this optimizer in an autoscaler to
dynamically change the allocated resources and batch sizes of individual DL
jobs.
We demonstrate empirically that our elastic scaling algorithm can complete up
to $\approx 2 \times$ as many jobs as compared to a strong baseline algorithm
that also scales the number of GPUs but does not change the batch size. We also
demonstrate that the average completion time with our algorithm is up to
$\approx 10 \times$ faster than that of the baseline.
Related papers
- Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research.
Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration.
Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z) - Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks.
However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs.
We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z) - Slapo: A Schedule Language for Progressive Optimization of Large Deep
Learning Model Training [17.556432199389615]
Slapo is a schedule language that decouples the execution of a tensor-level operator from its arithmetic definition.
We show that Slapo can improve training throughput by up to 2.92x on a single machine with 8 NVIDIA V100 GPUs.
arXiv Detail & Related papers (2023-02-16T00:34:53Z) - Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z) - Learning to Optimize Permutation Flow Shop Scheduling via Graph-based
Imitation Learning [70.65666982566655]
Permutation flow shop scheduling (PFSS) is widely used in manufacturing systems.
We propose to train the model via expert-driven imitation learning, which accelerates convergence more stably and accurately.
Our model's network parameters are reduced to only 37% of theirs, and the solution gap of our model towards the expert solutions decreases from 6.8% to 1.3% on average.
arXiv Detail & Related papers (2022-10-31T09:46:26Z) - Singularity: Planet-Scale, Preemptible, Elastic Scheduling of AI
Workloads [12.117736592836506]
We present Singularity, Microsoft's globally distributed scheduling service for deep learning training and inference workloads.
At the heart of Singularity is a novel, workload-aware scheduler that can transparently preempt and elastically scale deep learning workloads.
We show that the resulting efficiency and reliability gains with Singularity are achieved with negligible impact on the steady-state performance.
arXiv Detail & Related papers (2022-02-16T04:02:10Z) - Doing More by Doing Less: How Structured Partial Backpropagation
Improves Deep Learning Clusters [9.17259958324486]
Training deep learning models is resource-intensive, consuming significant compute, memory, and network resources.
We propose Structured Partial Backpropagation(SPB), a technique that controls the amount of backpropagation at individual workers in distributed training.
We find that JigSaw can improve large scale cluster efficiency by as high as 28%.
arXiv Detail & Related papers (2021-11-20T20:34:26Z) - Online Evolutionary Batch Size Orchestration for Scheduling Deep
Learning Workloads in GPU Clusters [10.395955671683245]
We propose ONES, an ONline Scheduler for elastic batch size orchestration.
ONES automatically manages the elasticity of each job based on the training batch size.
We show that ONES can outperform the prior deep learning schedulers with a significantly shorter average job completion time.
arXiv Detail & Related papers (2021-08-08T14:20:05Z) - BFTrainer: Low-Cost Training of Neural Networks on Unfillable
Supercomputer Nodes [0.8201100713224002]
FCFS-based scheduling policies result in many transient idle nodes.
We show how to realize a novel use for these otherwise wasted resources, namely, deep neural network (DNN) training.
arXiv Detail & Related papers (2021-06-22T22:53:19Z) - Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep
Learning [61.29990368322931]
Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors.
Pollux reduces average job completion times by 37-50% relative to state-of-the-art DL schedulers.
arXiv Detail & Related papers (2020-08-27T16:56:48Z) - Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of
Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models.
This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.