Resource Allocation and Workload Scheduling for Large-Scale Distributed Deep Learning: A Survey
- URL: http://arxiv.org/abs/2406.08115v1
- Date: Wed, 12 Jun 2024 11:51:44 GMT
- Title: Resource Allocation and Workload Scheduling for Large-Scale Distributed Deep Learning: A Survey
- Authors: Feng Liang, Zhen Zhang, Haifeng Lu, Chengming Li, Victor C. M. Leung, Yanyi Guo, Xiping Hu,
- Abstract summary: This survey reviews the literature, mainly from 2019 to 2024, on efficient resource allocation and workload scheduling strategies for large-scale distributed DL.
We highlight critical challenges for each topic and discuss key insights of existing technologies.
This survey aims to encourage computer science, artificial intelligence, and communications researchers to understand recent advances.
- Score: 48.06362354403557
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With rapidly increasing distributed deep learning workloads in large-scale data centers, efficient distributed deep learning framework strategies for resource allocation and workload scheduling have become the key to high-performance deep learning. The large-scale environment with large volumes of datasets, models, and computational and communication resources raises various unique challenges for resource allocation and workload scheduling in distributed deep learning, such as scheduling complexity, resource and workload heterogeneity, and fault tolerance. To uncover these challenges and corresponding solutions, this survey reviews the literature, mainly from 2019 to 2024, on efficient resource allocation and workload scheduling strategies for large-scale distributed DL. We explore these strategies by focusing on various resource types, scheduling granularity levels, and performance goals during distributed training and inference processes. We highlight critical challenges for each topic and discuss key insights of existing technologies. To illustrate practical large-scale resource allocation and workload scheduling in real distributed deep learning scenarios, we use a case study of training large language models. This survey aims to encourage computer science, artificial intelligence, and communications researchers to understand recent advances and explore future research directions for efficient framework strategies for large-scale distributed deep learning.
Related papers
- A Comprehensive Survey on Joint Resource Allocation Strategies in Federated Edge Learning [9.806901443019008]
Federated Edge Learning (FEL) enables model training in a distributed environment while ensuring user privacy by using physical separation for each user data.
With the development of complex application scenarios such as the Internet of Things (IoT) and Smart Earth, the conventional resource allocation schemes can no longer effectively support these growing computational and communication demands.
This paper systematically addresses the multifaceted challenges of computation and communication, with the growing multiple resource demands.
arXiv Detail & Related papers (2024-10-10T13:02:00Z) - Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey [43.57122822150023]
This article surveys the literature on algorithms and technologies aimed at achieving efficient communication in large-scale distributed deep learning.
We first introduce efficient algorithms for model synchronization and communication data compression in the context of large-scale distributed training.
Next, we introduce efficient strategies related to resource allocation and task scheduling for use in distributed training and inference.
arXiv Detail & Related papers (2024-04-09T08:35:04Z) - Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks.
However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs.
We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z) - A Survey on Applications of Reinforcement Learning in Spatial Resource
Allocation [5.821318691099762]
The challenge of spatial resource allocation is pervasive across various domains such as transportation, industry, and daily life.
Traditional algorithms face significant computational pressures, struggling to achieve optimal efficiency and real-time capabilities.
In recent years, there has been a surge in novel methods employing reinforcement learning to tackle spatial resource allocation problems.
arXiv Detail & Related papers (2024-03-06T12:05:56Z) - Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models [33.50873478562128]
Large Language Models (LLMs) bring forth challenges in the high consumption of computational, memory, energy, and financial resources.
This survey aims to systematically address these challenges by reviewing a broad spectrum of techniques designed to enhance the resource efficiency of LLMs.
arXiv Detail & Related papers (2024-01-01T01:12:42Z) - The Efficiency Spectrum of Large Language Models: An Algorithmic Survey [54.19942426544731]
The rapid growth of Large Language Models (LLMs) has been a driving force in transforming various domains.
This paper examines the multi-faceted dimensions of efficiency essential for the end-to-end algorithmic development of LLMs.
arXiv Detail & Related papers (2023-12-01T16:00:25Z) - A Review of Deep Reinforcement Learning in Serverless Computing:
Function Scheduling and Resource Auto-Scaling [2.0722667822370386]
This paper presents a comprehensive review of the application of Deep Reinforcement Learning (DRL) techniques in serverless computing.
A systematic review of recent studies applying DRL to serverless computing is presented, covering various algorithms, models, and performances.
Our analysis reveals that DRL, with its ability to learn and adapt from an environment, shows promising results in improving the efficiency of function scheduling and resource scaling.
arXiv Detail & Related papers (2023-10-05T09:26:04Z) - On Efficient Training of Large-Scale Deep Learning Models: A Literature
Review [90.87691246153612]
The field of deep learning has witnessed significant progress, particularly in computer vision (CV), natural language processing (NLP), and speech.
The use of large-scale models trained on vast amounts of data holds immense promise for practical applications.
With the increasing demands on computational capacity, a comprehensive summarization on acceleration techniques of training deep learning models is still much anticipated.
arXiv Detail & Related papers (2023-04-07T11:13:23Z) - Active Multi-Task Representation Learning [50.13453053304159]
We give the first formal study on resource task sampling by leveraging the techniques from active learning.
We propose an algorithm that iteratively estimates the relevance of each source task to the target task and samples from each source task based on the estimated relevance.
arXiv Detail & Related papers (2022-02-02T08:23:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.