Related papers: Synergy: Resource Sensitive DNN Scheduling in Multi-Tenant Clusters

Synergy: Resource Sensitive DNN Scheduling in Multi-Tenant Clusters

URL: http://arxiv.org/abs/2110.06073v1
Date: Tue, 12 Oct 2021 15:25:54 GMT
Title: Synergy: Resource Sensitive DNN Scheduling in Multi-Tenant Clusters
Authors: Jayashree Mohan, Amar Phanishayee, Janardhan Kulkarni, Vijay Chidambaram
Abstract summary: Training Deep Neural Networks (DNNs) is a widely popular workload in both enterprises and cloud data centers. We propose Synergy, a resource-sensitive scheduler for shared GPU clusters. Our experiments show that workload-aware CPU and memory allocations can improve average JCT up to 3.4x when compared to traditional GPU-proportional scheduling.
Score: 10.38396444951436
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Training Deep Neural Networks (DNNs) is a widely popular workload in both enterprises and cloud data centers. Existing schedulers for DNN training consider GPU as the dominant resource, and allocate other resources such as CPU and memory proportional to the number of GPUs requested by the job. Unfortunately, these schedulers do not consider the impact of a job's sensitivity to allocation of CPU, memory, and storage resources. In this work, we propose Synergy, a resource-sensitive scheduler for shared GPU clusters. Synergy infers the sensitivity of DNNs to different resources using optimistic profiling; some jobs might benefit from more than the GPU-proportional allocation and some jobs might not be affected by less than GPU-proportional allocation. Synergy performs such multi-resource workload-aware assignments across a set of jobs scheduled on shared multi-tenant clusters using a new near-optimal online algorithm. Our experiments show that workload-aware CPU and memory allocations can improve average JCT up to 3.4x when compared to traditional GPU-proportional scheduling.

Related papers

FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
SGPRS: Seamless GPU Partitioning Real-Time Scheduler for Periodic Deep Learning Workloads [0.9898607871253774]
We propose SGPRS, the first real-time GPU scheduler considering zero configuration partition switch. The proposed scheduler not only meets more deadlines for parallel tasks but also sustains overall performance beyond the pivot point.
arXiv Detail & Related papers (2024-04-13T18:29:26Z)
Compass: A Decentralized Scheduler for Latency-Sensitive ML Workflows [0.792324422300924]
We consider ML query processing in distributed systems where GPU-enabled workers coordinate to execute complex queries. In such systems, coscheduling of GPU memory management and task placement represents a promising opportunity. We propose Compass, a novel framework that unifies these functions to reduce job latency while using resources efficiently.
arXiv Detail & Related papers (2024-02-27T16:21:28Z)
Sparse-DySta: Sparsity-Aware Dynamic and Static Scheduling for Sparse Multi-DNN Workloads [65.47816359465155]
Running multiple deep neural networks (DNNs) in parallel has become an emerging workload in both edge devices. We propose Dysta, a novel scheduler that utilizes both static sparsity patterns and dynamic sparsity information for the sparse multi-DNN scheduling. Our proposed approach outperforms the state-of-the-art methods with up to 10% decrease in latency constraint violation rate and nearly 4X reduction in average normalized turnaround time.
arXiv Detail & Related papers (2023-10-17T09:25:17Z)
Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU [7.972518585452826]
concurrent running of multiple deep neural networks (DNN) Miriam is a contention-aware task coordination framework for multi-DNN inference on edge GPU.
arXiv Detail & Related papers (2023-07-10T04:30:44Z)
Communication-Efficient Graph Neural Networks with Probabilistic Neighborhood Expansion Analysis and Caching [59.8522166385372]
Training and inference with graph neural networks (GNNs) on massive graphs has been actively studied since the inception of GNNs. This paper is concerned with minibatch training and inference with GNNs that employ node-wise sampling in distributed settings. We present SALIENT++, which extends the prior state-of-the-art SALIENT system to work with partitioned feature data.
arXiv Detail & Related papers (2023-05-04T21:04:01Z)
Partitioning Distributed Compute Jobs with Reinforcement Learning and Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields. Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices. We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z)
Dynamic Split Computing for Efficient Deep Edge Intelligence [78.4233915447056]
We introduce dynamic split computing, where the optimal split location is dynamically selected based on the state of the communication channel. We show that dynamic split computing achieves faster inference in edge computing environments where the data rate and server load vary over time.
arXiv Detail & Related papers (2022-05-23T12:35:18Z)
Efficient Strong Scaling Through Burst Parallel Training [13.656104138147967]
Using large GPU clusters to train deep neural network (DNN) models is becoming an essential requirement. We present DeepPool, a system that addresses this efficiency challenge through two key ideas.
arXiv Detail & Related papers (2021-12-19T05:18:39Z)
BFTrainer: Low-Cost Training of Neural Networks on Unfillable Supercomputer Nodes [0.8201100713224002]
FCFS-based scheduling policies result in many transient idle nodes. We show how to realize a novel use for these otherwise wasted resources, namely, deep neural network (DNN) training.
arXiv Detail & Related papers (2021-06-22T22:53:19Z)
Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning [61.29990368322931]
Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors. Pollux reduces average job completion times by 37-50% relative to state-of-the-art DL schedulers.
arXiv Detail & Related papers (2020-08-27T16:56:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.