Related papers: A Simulation Platform for Multi-tenant Machine Learning Services on Thousands of GPUs

A Simulation Platform for Multi-tenant Machine Learning Services on Thousands of GPUs

URL: http://arxiv.org/abs/2201.03175v1
Date: Mon, 10 Jan 2022 06:00:11 GMT
Title: A Simulation Platform for Multi-tenant Machine Learning Services on Thousands of GPUs
Authors: Ruofan Liang, Bingsheng He, Shengen Yan, Peng Sun
Abstract summary: AnalySIM is a cluster simulator that allows efficient design explorations for multi-tenant machine learning services. It can easily test and analyze various scheduling policies in a number of performance metrics such as GPU resource utilization. We find that preemption and migration are able to significantly reduce average job completion time.
Score: 38.92672037891692
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-tenant machine learning services have become emerging data-intensive workloads in data centers with heavy usage of GPU resources. Due to the large scale, many tuning parameters and heavy resource usage, it is usually impractical to evaluate and benchmark those machine learning services on real clusters. In this demonstration, we present AnalySIM, a cluster simulator that allows efficient design explorations for multi-tenant machine learning services. Specifically, by trace-driven cluster workload simulation, AnalySIM can easily test and analyze various scheduling policies in a number of performance metrics such as GPU resource utilization. AnalySIM simulates the cluster computational resource based on both physical topology and logical partition. The tool has been used in SenseTime to understand the impact of different scheduling policies with the trace from a real production cluster of over 1000 GPUs. We find that preemption and migration are able to significantly reduce average job completion time and mitigate the resource fragmentation problem.

Related papers

GEqO: ML-Accelerated Semantic Equivalence Detection [3.5521901508676774]
Common computation is crucial for efficient cluster resource utilization and reducing job execution time. detecting equivalence on large-scale analytics engines requires efficient and scalable solutions that are fully automated. We propose GEqO, a portable and lightweight machine-learning-based framework for efficiently identifying semantically equivalent computations at scale.
arXiv Detail & Related papers (2024-01-02T16:37:42Z)
Parallel $Q$-Learning: Scaling Off-policy Reinforcement Learning under Massively Parallel Simulation [17.827002299991285]
Reinforcement learning is time-consuming for complex tasks due to the need for large amounts of training data. Recent advances in GPU-based simulation, such as Isaac Gym, have sped up data collection thousands of times on a commodity GPU. This paper presents a Parallel $Q$-Learning scheme that outperforms PPO in wall-clock time.
arXiv Detail & Related papers (2023-07-24T17:59:37Z)
In Situ Framework for Coupling Simulation and Machine Learning with Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations. As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks. This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z)
SimCS: Simulation for Domain Incremental Online Continual Segmentation [60.18777113752866]
Existing continual learning approaches mostly focus on image classification in the class-incremental setup. We propose SimCS, a parameter-free method complementary to existing ones that uses simulated data to regularize continual learning.
arXiv Detail & Related papers (2022-11-29T14:17:33Z)
Aryl: An Elastic Cluster Scheduler for Deep Learning [12.942546041713596]
We introduce Aryl, a new cluster scheduler to address problems for both training and inference. Aryl introduces capacity loaning to loan idle inference servers for training jobs. It improves cluster usage by up to 26.9% over the cluster scheduler without capacity loaning or elastic scaling.
arXiv Detail & Related papers (2022-02-16T07:03:25Z)
SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines. This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z)
Synergy: Resource Sensitive DNN Scheduling in Multi-Tenant Clusters [10.38396444951436]
Training Deep Neural Networks (DNNs) is a widely popular workload in both enterprises and cloud data centers. We propose Synergy, a resource-sensitive scheduler for shared GPU clusters. Our experiments show that workload-aware CPU and memory allocations can improve average JCT up to 3.4x when compared to traditional GPU-proportional scheduling.
arXiv Detail & Related papers (2021-10-12T15:25:54Z)
Providing Meaningful Data Summarizations Using Examplar-based Clustering in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms. We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z)
Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers [6.56704851092678]
We analyze distributed training performance under diverse cluster configurations using CM-DARE. Our empirical datasets include measurements from three GPU types, six geographic regions, twenty convolutional neural networks, and thousands of Google Cloud servers. We also demonstrate the feasibility of predicting training speed and overhead using regression-based models.
arXiv Detail & Related papers (2020-04-07T01:49:58Z)
On Coresets for Support Vector Machines [61.928187390362176]
A coreset is a small, representative subset of the original data points. We show that our algorithm can be used to extend the applicability of any off-the-shelf SVM solver to streaming, distributed, and dynamic data settings.
arXiv Detail & Related papers (2020-02-15T23:25:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.