Scavenger: A Cloud Service for Optimizing Cost and Performance of ML
Training
- URL: http://arxiv.org/abs/2303.06659v1
- Date: Sun, 12 Mar 2023 13:42:39 GMT
- Title: Scavenger: A Cloud Service for Optimizing Cost and Performance of ML
Training
- Authors: Sahil Tyagi, Prateek Sharma
- Abstract summary: We develop principled and practical techniques for optimizing the training time and cost of distributed ML model training on the cloud.
By combining conventional parallel scaling concepts and new insights into SGD noise, our models accurately estimate the time and cost on different cluster configurations with 5% error.
- Score: 1.047192732651018
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While the pay-as-you-go nature of cloud virtual machines (VMs) makes it easy
to spin-up large clusters for training ML models, it can also lead to
ballooning costs. The 100s of virtual machine sizes provided by cloud platforms
also makes it extremely challenging to select the ``right'' cloud cluster
configuration for training. Furthermore, the training time and cost of
distributed model training is highly sensitive to the cluster configurations,
and presents a large and complex tradeoff-space.
In this paper, we develop principled and practical techniques for optimizing
the training time and cost of distributed ML model training on the cloud. Our
key insight is that both parallel and statistical efficiency must be considered
when selecting the optimum job configuration parameters such as the number of
workers and the batch size. By combining conventional parallel scaling concepts
and new insights into SGD noise, our models accurately estimate the time and
cost on different cluster configurations with < 5% error. Using the repetitive
nature of training and our models, we can search for optimum cloud
configurations in a black-box, online manner. Our approach reduces training
times by 2 times and costs more more than 50%. Compared to an oracle-based
approach, our performance models are accurate to within 2% such that the search
imposes an overhead of just 10%.
Related papers
- Effective pruning of web-scale datasets based on complexity of concept
clusters [48.125618324485195]
We present a method for pruning large-scale multimodal datasets for training CLIP-style models on ImageNet.
We find that training on a smaller set of high-quality data can lead to higher performance with significantly lower training costs.
We achieve a new state-of-the-art Imagehttps://info.arxiv.org/help/prep#commentsNet zero-shot accuracy and a competitive average zero-shot accuracy on 38 evaluation tasks.
arXiv Detail & Related papers (2024-01-09T14:32:24Z) - How Can We Train Deep Learning Models Across Clouds and Continents? An Experimental Study [57.97785297481162]
We evaluate the cost and throughput implications of training in different zones, continents, and clouds for representative CV, NLP, and ASR models.
We show how leveraging spot pricing enables a new cost-efficient way to train models with multiple cheap instance, trumping both more centralized and powerful hardware and even on-demand cloud offerings at competitive prices.
arXiv Detail & Related papers (2023-06-05T18:17:37Z) - MILO: Model-Agnostic Subset Selection Framework for Efficient Model
Training and Tuning [68.12870241637636]
We propose MILO, a model-agnostic subset selection framework that decouples the subset selection from model training.
Our empirical results indicate that MILO can train models $3times - 10 times$ faster and tune hyperparameters $20times - 75 times$ faster than full-dataset training or tuning without performance.
arXiv Detail & Related papers (2023-01-30T20:59:30Z) - Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints [59.39280540478479]
We propose sparse upcycling -- a simple way to reuse sunk training costs by initializing a sparsely activated Mixture-of-Experts model from a dense checkpoint.
We show that sparsely upcycled T5 Base, Large, and XL language models and Vision Transformer Base and Large models, respectively, significantly outperform their dense counterparts on SuperGLUE and ImageNet.
arXiv Detail & Related papers (2022-12-09T18:57:37Z) - Sampling Streaming Data with Parallel Vector Quantization -- PVQ [0.0]
We present a vector quantization-based sampling method, which substantially reduces the class imbalance in data streams.
We built models using parallel processing, batch processing, and randomly selecting samples.
We show that the accuracy of classification models improves when the data streams are pre-processed with our method.
arXiv Detail & Related papers (2022-10-04T17:59:44Z) - Decentralized Training of Foundation Models in Heterogeneous
Environments [77.47261769795992]
Training foundation models, such as GPT-3 and PaLM, can be extremely expensive.
We present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network.
arXiv Detail & Related papers (2022-06-02T20:19:51Z) - Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution.
Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x.
We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z) - Cocktail: Leveraging Ensemble Learning for Optimized Model Serving in
Public Cloud [9.149566952446058]
We proposeCocktail, a costeffective ensembling-based model serving framework.
A prototype implementation ofCocktailon the AWS EC2 platform and exhaustive evalua-tions using a variety of workloads demonstrate thatCocktailcan reduce deployment cost by 1.45x.
arXiv Detail & Related papers (2021-06-09T19:23:58Z) - Characterizing and Modeling Distributed Training with Transient Cloud
GPU Servers [6.56704851092678]
We analyze distributed training performance under diverse cluster configurations using CM-DARE.
Our empirical datasets include measurements from three GPU types, six geographic regions, twenty convolutional neural networks, and thousands of Google Cloud servers.
We also demonstrate the feasibility of predicting training speed and overhead using regression-based models.
arXiv Detail & Related papers (2020-04-07T01:49:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.