Related papers: Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters

Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters

URL: http://arxiv.org/abs/2109.01313v2
Date: Mon, 6 Sep 2021 01:26:38 GMT
Title: Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters
Authors: Qinghao Hu, Peng Sun, Shengen Yan, Yonggang Wen, Tianwei Zhang
Abstract summary: We present a comprehensive study about the characteristics of Deep Learning jobs and resource management. We introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design: a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job completion time by up to 6.5x; and a Cluster Energy Saving service, which improves overall cluster utilization by up to 13%.
Score: 30.952491139350908
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep understanding of the job features and user behaviors. We present a comprehensive study about the characteristics of DL jobs and resource management. First, we perform a large-scale analysis of real-world job traces from SenseTime. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which can facilitate the cluster system designs. Second, we introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design: a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job completion time by up to 6.5x; and a Cluster Energy Saving service, which improves overall cluster utilization by up to 13%.

Related papers

Scaling DRL for Decision Making: A Survey on Data, Network, and Training Budget Strategies [66.83950068218033]
Scaling Laws demonstrate that scaling model parameters and training data enhances learning performance.<n>Despite its potential to improve performance, the integration of scaling laws into deep reinforcement learning has not been fully realized.<n>This review addresses this gap by systematically analyzing scaling strategies in three dimensions: data, network, and training budget.
arXiv Detail & Related papers (2025-08-05T08:03:12Z)
Resource Allocation and Workload Scheduling for Large-Scale Distributed Deep Learning: A Survey [48.06362354403557]
This survey reviews the literature, mainly from 2019 to 2024, on efficient resource allocation and workload scheduling strategies for large-scale distributed DL. We highlight critical challenges for each topic and discuss key insights of existing technologies. This survey aims to encourage computer science, artificial intelligence, and communications researchers to understand recent advances.
arXiv Detail & Related papers (2024-06-12T11:51:44Z)
Deep Learning for Trajectory Data Management and Mining: A Survey and Beyond [58.63558696061679]
Trajectory computing is crucial in various practical applications such as location services, urban traffic, and public safety. We present a review of development and recent advances in deep learning for trajectory computing (DL4Traj) Notably, we encapsulate recent advancements in Large Language Models (LLMs) that hold potential to augment trajectory computing.
arXiv Detail & Related papers (2024-03-21T05:57:27Z)
Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs. We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z)
A Review of Deep Reinforcement Learning in Serverless Computing: Function Scheduling and Resource Auto-Scaling [2.0722667822370386]
This paper presents a comprehensive review of the application of Deep Reinforcement Learning (DRL) techniques in serverless computing. A systematic review of recent studies applying DRL to serverless computing is presented, covering various algorithms, models, and performances. Our analysis reveals that DRL, with its ability to learn and adapt from an environment, shows promising results in improving the efficiency of function scheduling and resource scaling.
arXiv Detail & Related papers (2023-10-05T09:26:04Z)
On Efficient Training of Large-Scale Deep Learning Models: A Literature Review [90.87691246153612]
The field of deep learning has witnessed significant progress, particularly in computer vision (CV), natural language processing (NLP), and speech. The use of large-scale models trained on vast amounts of data holds immense promise for practical applications. With the increasing demands on computational capacity, a comprehensive summarization on acceleration techniques of training deep learning models is still much anticipated.
arXiv Detail & Related papers (2023-04-07T11:13:23Z)
Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision [23.09494338914838]
This paper surveys existing research efforts for both training and inference workloads. We primarily present how existing schedulers facilitate the respective workloads from the scheduling objectives and resource consumption features.
arXiv Detail & Related papers (2022-05-24T09:18:06Z)
The MIT Supercloud Workload Classification Challenge [10.458111248130944]
In this paper, we present a workload classification challenge based on the MIT Supercloud dataset. The goal of this challenge is to foster algorithmic innovations in the analysis of compute workloads.
arXiv Detail & Related papers (2022-04-12T14:28:04Z)
Dif-MAML: Decentralized Multi-Agent Meta-Learning [54.39661018886268]
We propose a cooperative multi-agent meta-learning algorithm, referred to as MAML or Dif-MAML. We show that the proposed strategy allows a collection of agents to attain agreement at a linear rate and to converge to a stationary point of the aggregate MAML. Simulation results illustrate the theoretical findings and the superior performance relative to the traditional non-cooperative setting.
arXiv Detail & Related papers (2020-10-06T16:51:09Z)
A Privacy-Preserving Distributed Architecture for Deep-Learning-as-a-Service [68.84245063902908]
This paper introduces a novel distributed architecture for deep-learning-as-a-service. It is able to preserve the user sensitive data while providing Cloud-based machine and deep learning services.
arXiv Detail & Related papers (2020-03-30T15:12:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.