Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy,
Challenges and Vision
- URL: http://arxiv.org/abs/2205.11913v2
- Date: Wed, 25 May 2022 06:24:54 GMT
- Title: Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy,
Challenges and Vision
- Authors: Wei Gao, Qinghao Hu, Zhisheng Ye, Peng Sun, Xiaolin Wang, Yingwei Luo,
Tianwei Zhang, Yonggang Wen
- Abstract summary: This paper surveys existing research efforts for both training and inference workloads.
We primarily present how existing schedulers facilitate the respective workloads from the scheduling objectives and resource consumption features.
- Score: 23.09494338914838
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning (DL) shows its prosperity in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure.
Hence, dedicated GPU accelerators have been collectively constructed into a GPU
datacenter. An efficient scheduler design for such GPU datacenter is crucially
important to reduce the operational cost and improve resource utilization.
However, traditional approaches designed for big data or high performance
computing workloads can not support DL workloads to fully utilize the GPU
resources. Recently, substantial schedulers are proposed to tailor for DL
workloads in GPU datacenters. This paper surveys existing research efforts for
both training and inference workloads. We primarily present how existing
schedulers facilitate the respective workloads from the scheduling objectives
and resource consumption features. Finally, we prospect several promising
future research directions. More detailed summary with the surveyed paper and
code links can be found at our project website:
https://github.com/S-Lab-System-Group/Awesome-DL-Scheduling-Papers
Related papers
- Deep Learning for Trajectory Data Management and Mining: A Survey and Beyond [58.63558696061679]
Trajectory computing is crucial in various practical applications such as location services, urban traffic, and public safety.
We present a review of development and recent advances in deep learning for trajectory computing (DL4Traj)
Notably, we encapsulate recent advancements in Large Language Models (LLMs) that hold potential to augment trajectory computing.
arXiv Detail & Related papers (2024-03-21T05:57:27Z) - Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly [62.473245910234304]
This paper takes a hardware-centric approach to explore how Large Language Models can be brought to modern edge computing systems.
We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions.
arXiv Detail & Related papers (2023-10-04T20:27:20Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - SoTaNa: The Open-Source Software Development Assistant [81.86136560157266]
SoTaNa is an open-source software development assistant.
It generates high-quality instruction-based data for the domain of software engineering.
It employs a parameter-efficient fine-tuning approach to enhance the open-source foundation model, LLaMA.
arXiv Detail & Related papers (2023-08-25T14:56:21Z) - A Frequency-aware Software Cache for Large Recommendation System
Embeddings [11.873521953539361]
Deep learning recommendation models (DLRMs) have been widely applied in Internet companies.
We propose a GPU-based software cache approaches to dynamically manage the embedding table in the CPU and GPU memory space.
Our proposed software cache is efficient in training entire DLRMs on GPU in a synchronized update manner.
arXiv Detail & Related papers (2022-08-08T12:08:05Z) - ElegantRL-Podracer: Scalable and Elastic Library for Cloud-Native Deep
Reinforcement Learning [141.58588761593955]
We present a library ElegantRL-podracer for cloud-native deep reinforcement learning.
It efficiently supports millions of cores to carry out massively parallel training at multiple levels.
At a low-level, each pod simulates agent-environment interactions in parallel by fully utilizing nearly 7,000 GPU cores in a single GPU.
arXiv Detail & Related papers (2021-12-11T06:31:21Z) - Synergy: Resource Sensitive DNN Scheduling in Multi-Tenant Clusters [10.38396444951436]
Training Deep Neural Networks (DNNs) is a widely popular workload in both enterprises and cloud data centers.
We propose Synergy, a resource-sensitive scheduler for shared GPU clusters.
Our experiments show that workload-aware CPU and memory allocations can improve average JCT up to 3.4x when compared to traditional GPU-proportional scheduling.
arXiv Detail & Related papers (2021-10-12T15:25:54Z) - Characterization and Prediction of Deep Learning Workloads in
Large-Scale GPU Datacenters [30.952491139350908]
We present a comprehensive study about the characteristics of Deep Learning jobs and resource management.
We introduce a general-purpose framework, which manages resources based on historical data.
As case studies, we design: a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job completion time by up to 6.5x; and a Cluster Energy Saving service, which improves overall cluster utilization by up to 13%.
arXiv Detail & Related papers (2021-09-03T05:02:52Z) - Providing Meaningful Data Summarizations Using Examplar-based Clustering
in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms.
We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z) - Understanding Training Efficiency of Deep Learning Recommendation Models
at Scale [8.731263641794897]
This paper explains the intricacies of using GPUs for training recommendation models.
factors affecting hardware efficiency at scale, and learnings from a new scale-up GPU server design, Zion.
arXiv Detail & Related papers (2020-11-11T01:21:43Z) - Importance of Data Loading Pipeline in Training Deep Neural Networks [2.127049691404299]
In large models, the time spent loading data takes a significant portion of model training time.
We compare binary data format to accelerate data reading, and NVIDIA DALI to accelerate data augmentation.
Our study shows improvement on the order of 20% to 40% if such dedicated tools are used.
arXiv Detail & Related papers (2020-04-21T14:19:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.