Characterizing and Modeling Distributed Training with Transient Cloud
GPU Servers
- URL: http://arxiv.org/abs/2004.03072v1
- Date: Tue, 7 Apr 2020 01:49:58 GMT
- Title: Characterizing and Modeling Distributed Training with Transient Cloud
GPU Servers
- Authors: Shijian Li and Robert J. Walls and Tian Guo
- Abstract summary: We analyze distributed training performance under diverse cluster configurations using CM-DARE.
Our empirical datasets include measurements from three GPU types, six geographic regions, twenty convolutional neural networks, and thousands of Google Cloud servers.
We also demonstrate the feasibility of predicting training speed and overhead using regression-based models.
- Score: 6.56704851092678
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cloud GPU servers have become the de facto way for deep learning
practitioners to train complex models on large-scale datasets. However, it is
challenging to determine the appropriate cluster configuration---e.g., server
type and number---for different training workloads while balancing the
trade-offs in training time, cost, and model accuracy. Adding to the complexity
is the potential to reduce the monetary cost by using cheaper, but revocable,
transient GPU servers.
In this work, we analyze distributed training performance under diverse
cluster configurations using CM-DARE, a cloud-based measurement and training
framework. Our empirical datasets include measurements from three GPU types,
six geographic regions, twenty convolutional neural networks, and thousands of
Google Cloud servers. We also demonstrate the feasibility of predicting
training speed and overhead using regression-based models. Finally, we discuss
potential use cases of our performance modeling such as detecting and
mitigating performance bottlenecks.
Related papers
- TensorSocket: Shared Data Loading for Deep Learning Training [0.0]
Deep learning training is a repetitive and resource-intensive process.
socket enables simultaneous training processes to share the same data loader.
Our evaluation shows thatsocket enables scenarios that are infeasible without data sharing, increases training throughput by up to $100%$.
arXiv Detail & Related papers (2024-09-27T13:39:47Z) - Effective pruning of web-scale datasets based on complexity of concept
clusters [48.125618324485195]
We present a method for pruning large-scale multimodal datasets for training CLIP-style models on ImageNet.
We find that training on a smaller set of high-quality data can lead to higher performance with significantly lower training costs.
We achieve a new state-of-the-art Imagehttps://info.arxiv.org/help/prep#commentsNet zero-shot accuracy and a competitive average zero-shot accuracy on 38 evaluation tasks.
arXiv Detail & Related papers (2024-01-09T14:32:24Z) - Dataset Quantization [72.61936019738076]
We present dataset quantization (DQ), a new framework to compress large-scale datasets into small subsets.
DQ is the first method that can successfully distill large-scale datasets such as ImageNet-1k with a state-of-the-art compression ratio.
arXiv Detail & Related papers (2023-08-21T07:24:29Z) - How Can We Train Deep Learning Models Across Clouds and Continents? An Experimental Study [57.97785297481162]
We evaluate the cost and throughput implications of training in different zones, continents, and clouds for representative CV, NLP, and ASR models.
We show how leveraging spot pricing enables a new cost-efficient way to train models with multiple cheap instance, trumping both more centralized and powerful hardware and even on-demand cloud offerings at competitive prices.
arXiv Detail & Related papers (2023-06-05T18:17:37Z) - Scavenger: A Cloud Service for Optimizing Cost and Performance of ML
Training [1.047192732651018]
We develop principled and practical techniques for optimizing the training time and cost of distributed ML model training on the cloud.
By combining conventional parallel scaling concepts and new insights into SGD noise, our models accurately estimate the time and cost on different cluster configurations with 5% error.
arXiv Detail & Related papers (2023-03-12T13:42:39Z) - Decentralized Training of Foundation Models in Heterogeneous
Environments [77.47261769795992]
Training foundation models, such as GPT-3 and PaLM, can be extremely expensive.
We present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network.
arXiv Detail & Related papers (2022-06-02T20:19:51Z) - How Well Do Sparse Imagenet Models Transfer? [75.98123173154605]
Transfer learning is a classic paradigm by which models pretrained on large "upstream" datasets are adapted to yield good results on "downstream" datasets.
In this work, we perform an in-depth investigation of this phenomenon in the context of convolutional neural networks (CNNs) trained on the ImageNet dataset.
We show that sparse models can match or even outperform the transfer performance of dense models, even at high sparsities.
arXiv Detail & Related papers (2021-11-26T11:58:51Z) - LCS: Learning Compressible Subspaces for Adaptive Network Compression at
Inference Time [57.52251547365967]
We propose a method for training a "compressible subspace" of neural networks that contains a fine-grained spectrum of models.
We present results for achieving arbitrarily fine-grained accuracy-efficiency trade-offs at inference time for structured and unstructured sparsity.
Our algorithm extends to quantization at variable bit widths, achieving accuracy on par with individually trained networks.
arXiv Detail & Related papers (2021-10-08T17:03:34Z) - Sampling Training Data for Continual Learning Between Robots and the
Cloud [26.116999231118793]
We introduce HarvestNet, an intelligent sampling algorithm that resides on-board a robot and reduces system bottlenecks.
It significantly improves the accuracy of machine-learning models on our novel dataset of road construction sites, field testing of self-driving cars, and streaming face recognition.
It is between 1.05-2.58x more accurate than baseline algorithms and scalably runs on embedded deep learning hardware.
arXiv Detail & Related papers (2020-12-12T05:52:33Z) - Towards Scalable Distributed Training of Deep Learning on Public Cloud
Clusters [30.4449309904155]
We propose a new top-k sparsification communication library for distributed training.
We show that our system achieves 25%-40% faster than existing state-of-the-art systems on CNNs and Transformer.
arXiv Detail & Related papers (2020-10-20T17:16:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.