Related papers: Scavenger: A Cloud Service for Optimizing Cost and Performance of ML Training

Scavenger: A Cloud Service for Optimizing Cost and Performance of ML Training

URL: http://arxiv.org/abs/2303.06659v1
Date: Sun, 12 Mar 2023 13:42:39 GMT
Title: Scavenger: A Cloud Service for Optimizing Cost and Performance of ML Training
Authors: Sahil Tyagi, Prateek Sharma
Abstract summary: We develop principled and practical techniques for optimizing the training time and cost of distributed ML model training on the cloud. By combining conventional parallel scaling concepts and new insights into SGD noise, our models accurately estimate the time and cost on different cluster configurations with 5% error.
Score: 1.047192732651018
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While the pay-as-you-go nature of cloud virtual machines (VMs) makes it easy to spin-up large clusters for training ML models, it can also lead to ballooning costs. The 100s of virtual machine sizes provided by cloud platforms also makes it extremely challenging to select the ``right'' cloud cluster configuration for training. Furthermore, the training time and cost of distributed model training is highly sensitive to the cluster configurations, and presents a large and complex tradeoff-space. In this paper, we develop principled and practical techniques for optimizing the training time and cost of distributed ML model training on the cloud. Our key insight is that both parallel and statistical efficiency must be considered when selecting the optimum job configuration parameters such as the number of workers and the batch size. By combining conventional parallel scaling concepts and new insights into SGD noise, our models accurately estimate the time and cost on different cluster configurations with < 5% error. Using the repetitive nature of training and our models, we can search for optimum cloud configurations in a black-box, online manner. Our approach reduces training times by 2 times and costs more more than 50%. Compared to an oracle-based approach, our performance models are accurate to within 2% such that the search imposes an overhead of just 10%.

Related papers

AutoHete: An Automatic and Efficient Heterogeneous Training System for LLMs [68.99086112477565]
Transformer-based large language models (LLMs) have demonstrated exceptional capabilities in sequence modeling and text generation. Existing heterogeneous training methods significantly expand the scale of trainable models but introduce substantial communication overheads and CPU workloads. We propose AutoHete, an automatic and efficient heterogeneous training system compatible with both single- GPU and multi- GPU environments.
arXiv Detail & Related papers (2025-02-27T14:46:22Z)
Effective pruning of web-scale datasets based on complexity of concept clusters [48.125618324485195]
We present a method for pruning large-scale multimodal datasets for training CLIP-style models on ImageNet. We find that training on a smaller set of high-quality data can lead to higher performance with significantly lower training costs. We achieve a new state-of-the-art Imagehttps://info.arxiv.org/help/prep#commentsNet zero-shot accuracy and a competitive average zero-shot accuracy on 38 evaluation tasks.
arXiv Detail & Related papers (2024-01-09T14:32:24Z)
How Can We Train Deep Learning Models Across Clouds and Continents? An Experimental Study [57.97785297481162]
We evaluate the cost and throughput implications of training in different zones, continents, and clouds for representative CV, NLP, and ASR models. We show how leveraging spot pricing enables a new cost-efficient way to train models with multiple cheap instance, trumping both more centralized and powerful hardware and even on-demand cloud offerings at competitive prices.
arXiv Detail & Related papers (2023-06-05T18:17:37Z)
MILO: Model-Agnostic Subset Selection Framework for Efficient Model Training and Tuning [68.12870241637636]
We propose MILO, a model-agnostic subset selection framework that decouples the subset selection from model training. Our empirical results indicate that MILO can train models $3times - 10 times$ faster and tune hyperparameters $20times - 75 times$ faster than full-dataset training or tuning without performance.
arXiv Detail & Related papers (2023-01-30T20:59:30Z)
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints [59.39280540478479]
We propose sparse upcycling -- a simple way to reuse sunk training costs by initializing a sparsely activated Mixture-of-Experts model from a dense checkpoint. We show that sparsely upcycled T5 Base, Large, and XL language models and Vision Transformer Base and Large models, respectively, significantly outperform their dense counterparts on SuperGLUE and ImageNet.
arXiv Detail & Related papers (2022-12-09T18:57:37Z)
Sampling Streaming Data with Parallel Vector Quantization -- PVQ [0.0]
We present a vector quantization-based sampling method, which substantially reduces the class imbalance in data streams. We built models using parallel processing, batch processing, and randomly selecting samples. We show that the accuracy of classification models improves when the data streams are pre-processed with our method.
arXiv Detail & Related papers (2022-10-04T17:59:44Z)
Decentralized Training of Foundation Models in Heterogeneous Environments [77.47261769795992]
Training foundation models, such as GPT-3 and PaLM, can be extremely expensive. We present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network.
arXiv Detail & Related papers (2022-06-02T20:19:51Z)
Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution. Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x. We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z)
Cocktail: Leveraging Ensemble Learning for Optimized Model Serving in Public Cloud [9.149566952446058]
We proposeCocktail, a costeffective ensembling-based model serving framework. A prototype implementation ofCocktailon the AWS EC2 platform and exhaustive evalua-tions using a variety of workloads demonstrate thatCocktailcan reduce deployment cost by 1.45x.
arXiv Detail & Related papers (2021-06-09T19:23:58Z)
Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers [6.56704851092678]
We analyze distributed training performance under diverse cluster configurations using CM-DARE. Our empirical datasets include measurements from three GPU types, six geographic regions, twenty convolutional neural networks, and thousands of Google Cloud servers. We also demonstrate the feasibility of predicting training speed and overhead using regression-based models.
arXiv Detail & Related papers (2020-04-07T01:49:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.