Machine Learning on Volatile Instances
- URL: http://arxiv.org/abs/2003.05649v1
- Date: Thu, 12 Mar 2020 07:47:34 GMT
- Title: Machine Learning on Volatile Instances
- Authors: Xiaoxi Zhang, Jianyu Wang, Gauri Joshi, and Carlee Joe-Wong
- Abstract summary: This work is the first to quantify how variations in the number of active worker nodes (as a result of preemption) affects SGD convergence and the time to train the model.
We propose cost-effective strategies to exploit volatile cloud instances that are cheaper than standard instances.
- Score: 40.19551148721116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the massive size of the neural network models and training datasets
used in machine learning today, it is imperative to distribute stochastic
gradient descent (SGD) by splitting up tasks such as gradient evaluation across
multiple worker nodes. However, running distributed SGD can be prohibitively
expensive because it may require specialized computing resources such as GPUs
for extended periods of time. We propose cost-effective strategies to exploit
volatile cloud instances that are cheaper than standard instances, but may be
interrupted by higher priority workloads. To the best of our knowledge, this
work is the first to quantify how variations in the number of active worker
nodes (as a result of preemption) affects SGD convergence and the time to train
the model. By understanding these trade-offs between preemption probability of
the instances, accuracy, and training time, we are able to derive practical
strategies for configuring distributed SGD jobs on volatile instances such as
Amazon EC2 spot instances and other preemptible cloud instances. Experimental
results show that our strategies achieve good training performance at
substantially lower cost.
Related papers
- FINE: Factorizing Knowledge for Initialization of Variable-sized Diffusion Models [35.40065954148091]
FINE is a method based on the Learngene framework to initializing downstream networks leveraging pre-trained models.
It decomposes pre-trained knowledge into the product of matrices (i.e., $U$, $Sigma$, and $V$), where $U$ and $V$ are shared across network blocks as learngenes''
It consistently outperforms direct pre-training, particularly for smaller models, achieving state-of-the-art results across variable model sizes.
arXiv Detail & Related papers (2024-09-28T08:57:17Z) - Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs [18.242110417706]
This work focuses on leveraging and selecting from vast, unlabeled, open data to pre-fine-tune a pre-trained language model.
We show the optimality of this approach for fine-tuning tasks under certain conditions.
Our proposed method is significantly faster than existing techniques, scaling to millions of samples within a single GPU hour.
arXiv Detail & Related papers (2024-05-05T00:08:00Z) - Just One Byte (per gradient): A Note on Low-Bandwidth Decentralized
Language Model Finetuning Using Shared Randomness [86.61582747039053]
Language model training in distributed settings is limited by the communication cost of exchanges.
We extend recent work using shared randomness to perform distributed fine-tuning with low bandwidth.
arXiv Detail & Related papers (2023-06-16T17:59:51Z) - Value function estimation using conditional diffusion models for control [62.27184818047923]
We propose a simple algorithm called Diffused Value Function (DVF)
It learns a joint multi-step model of the environment-robot interaction dynamics using a diffusion model.
We show how DVF can be used to efficiently capture the state visitation measure for multiple controllers.
arXiv Detail & Related papers (2023-06-09T18:40:55Z) - Unifying Synergies between Self-supervised Learning and Dynamic
Computation [53.66628188936682]
We present a novel perspective on the interplay between SSL and DC paradigms.
We show that it is feasible to simultaneously learn a dense and gated sub-network from scratch in a SSL setting.
The co-evolution during pre-training of both dense and gated encoder offers a good accuracy-efficiency trade-off.
arXiv Detail & Related papers (2023-01-22T17:12:58Z) - Distributed Adversarial Training to Robustify Deep Neural Networks at
Scale [100.19539096465101]
Current deep neural networks (DNNs) are vulnerable to adversarial attacks, where adversarial perturbations to the inputs can change or manipulate classification.
To defend against such attacks, an effective approach, known as adversarial training (AT), has been shown to mitigate robust training.
We propose a large-batch adversarial training framework implemented over multiple machines.
arXiv Detail & Related papers (2022-06-13T15:39:43Z) - Singularity: Planet-Scale, Preemptible, Elastic Scheduling of AI
Workloads [12.117736592836506]
We present Singularity, Microsoft's globally distributed scheduling service for deep learning training and inference workloads.
At the heart of Singularity is a novel, workload-aware scheduler that can transparently preempt and elastically scale deep learning workloads.
We show that the resulting efficiency and reliability gains with Singularity are achieved with negligible impact on the steady-state performance.
arXiv Detail & Related papers (2022-02-16T04:02:10Z) - Accelerating Deep Learning with Dynamic Data Pruning [0.0]
Deep learning has become prohibitively costly, requiring access to powerful computing systems to train state-of-the-art networks.
Previous work, such as forget scores and GraNd/EL2N scores, identify important samples within a full dataset and pruning the remaining samples, thereby reducing the iterations per epoch.
We propose two algorithms, based on reinforcement learning techniques, to dynamically prune samples and achieve even higher accuracy than the random dynamic method.
arXiv Detail & Related papers (2021-11-24T16:47:34Z) - Predicting Training Time Without Training [120.92623395389255]
We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function.
We leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model.
We are able to predict the time it takes to fine-tune a model to a given loss without having to perform any training.
arXiv Detail & Related papers (2020-08-28T04:29:54Z) - DaSGD: Squeezing SGD Parallelization Performance in Distributed Training
Using Delayed Averaging [4.652668321425679]
Minibatch gradient descent (SGD) algorithm requires workers to halt forward/back propagations.
DaSGD parallelizes SGD and forward/back propagations to hide 100% of the communication overhead.
arXiv Detail & Related papers (2020-05-31T05:43:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.