Related papers: Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers

Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers

URL: http://arxiv.org/abs/2004.03072v1
Date: Tue, 7 Apr 2020 01:49:58 GMT
Title: Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers
Authors: Shijian Li and Robert J. Walls and Tian Guo
Abstract summary: We analyze distributed training performance under diverse cluster configurations using CM-DARE. Our empirical datasets include measurements from three GPU types, six geographic regions, twenty convolutional neural networks, and thousands of Google Cloud servers. We also demonstrate the feasibility of predicting training speed and overhead using regression-based models.
Score: 6.56704851092678
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Cloud GPU servers have become the de facto way for deep learning practitioners to train complex models on large-scale datasets. However, it is challenging to determine the appropriate cluster configuration---e.g., server type and number---for different training workloads while balancing the trade-offs in training time, cost, and model accuracy. Adding to the complexity is the potential to reduce the monetary cost by using cheaper, but revocable, transient GPU servers. In this work, we analyze distributed training performance under diverse cluster configurations using CM-DARE, a cloud-based measurement and training framework. Our empirical datasets include measurements from three GPU types, six geographic regions, twenty convolutional neural networks, and thousands of Google Cloud servers. We also demonstrate the feasibility of predicting training speed and overhead using regression-based models. Finally, we discuss potential use cases of our performance modeling such as detecting and mitigating performance bottlenecks.

Related papers

Deep Learning Model Deployment in Multiple Cloud Providers: an Exploratory Study Using Low Computing Power Environments [0.0]
This study demonstrates the feasibility and affordability of cloud-based Machine Learning inference solutions without GPU. We evaluate real-time latency, hardware usage and cost at each cloud provider by 7 execution environments with 10 experiments reproduced.
arXiv Detail & Related papers (2025-03-31T11:58:37Z)
TensorSocket: Shared Data Loading for Deep Learning Training [0.0]
Deep learning training is a repetitive and resource-intensive process. socket enables simultaneous training processes to share the same data loader. Our evaluation shows thatsocket enables scenarios that are infeasible without data sharing, increases training throughput by up to $100%$.
arXiv Detail & Related papers (2024-09-27T13:39:47Z)
OmniBal: Towards Fast Instruction-Tuning for Vision-Language Models via Omniverse Computation Balance [65.48009829137824]
Large-scale 3D parallel training on vision-language instruction-tuning models leads to an imbalanced computation load across different devices.<n>We rebalance the computational load from data, model, and memory perspectives, achieving more balanced computation across devices.<n>Our method's efficacy and generalizability are further validated across various models and datasets.
arXiv Detail & Related papers (2024-07-30T12:02:58Z)
Effective pruning of web-scale datasets based on complexity of concept clusters [48.125618324485195]
We present a method for pruning large-scale multimodal datasets for training CLIP-style models on ImageNet. We find that training on a smaller set of high-quality data can lead to higher performance with significantly lower training costs. We achieve a new state-of-the-art Imagehttps://info.arxiv.org/help/prep#commentsNet zero-shot accuracy and a competitive average zero-shot accuracy on 38 evaluation tasks.
arXiv Detail & Related papers (2024-01-09T14:32:24Z)
Dataset Quantization [72.61936019738076]
We present dataset quantization (DQ), a new framework to compress large-scale datasets into small subsets. DQ is the first method that can successfully distill large-scale datasets such as ImageNet-1k with a state-of-the-art compression ratio.
arXiv Detail & Related papers (2023-08-21T07:24:29Z)
How Can We Train Deep Learning Models Across Clouds and Continents? An Experimental Study [57.97785297481162]
We evaluate the cost and throughput implications of training in different zones, continents, and clouds for representative CV, NLP, and ASR models. We show how leveraging spot pricing enables a new cost-efficient way to train models with multiple cheap instance, trumping both more centralized and powerful hardware and even on-demand cloud offerings at competitive prices.
arXiv Detail & Related papers (2023-06-05T18:17:37Z)
Scavenger: A Cloud Service for Optimizing Cost and Performance of ML Training [1.047192732651018]
We develop principled and practical techniques for optimizing the training time and cost of distributed ML model training on the cloud. By combining conventional parallel scaling concepts and new insights into SGD noise, our models accurately estimate the time and cost on different cluster configurations with 5% error.
arXiv Detail & Related papers (2023-03-12T13:42:39Z)
Decentralized Training of Foundation Models in Heterogeneous Environments [77.47261769795992]
Training foundation models, such as GPT-3 and PaLM, can be extremely expensive. We present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network.
arXiv Detail & Related papers (2022-06-02T20:19:51Z)
How Well Do Sparse Imagenet Models Transfer? [75.98123173154605]
Transfer learning is a classic paradigm by which models pretrained on large "upstream" datasets are adapted to yield good results on "downstream" datasets. In this work, we perform an in-depth investigation of this phenomenon in the context of convolutional neural networks (CNNs) trained on the ImageNet dataset. We show that sparse models can match or even outperform the transfer performance of dense models, even at high sparsities.
arXiv Detail & Related papers (2021-11-26T11:58:51Z)
LCS: Learning Compressible Subspaces for Adaptive Network Compression at Inference Time [57.52251547365967]
We propose a method for training a "compressible subspace" of neural networks that contains a fine-grained spectrum of models. We present results for achieving arbitrarily fine-grained accuracy-efficiency trade-offs at inference time for structured and unstructured sparsity. Our algorithm extends to quantization at variable bit widths, achieving accuracy on par with individually trained networks.
arXiv Detail & Related papers (2021-10-08T17:03:34Z)
Sampling Training Data for Continual Learning Between Robots and the Cloud [26.116999231118793]
We introduce HarvestNet, an intelligent sampling algorithm that resides on-board a robot and reduces system bottlenecks. It significantly improves the accuracy of machine-learning models on our novel dataset of road construction sites, field testing of self-driving cars, and streaming face recognition. It is between 1.05-2.58x more accurate than baseline algorithms and scalably runs on embedded deep learning hardware.
arXiv Detail & Related papers (2020-12-12T05:52:33Z)
Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters [30.4449309904155]
We propose a new top-k sparsification communication library for distributed training. We show that our system achieves 25%-40% faster than existing state-of-the-art systems on CNNs and Transformer.
arXiv Detail & Related papers (2020-10-20T17:16:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.