Dynamic backup workers for parallel machine learning
- URL: http://arxiv.org/abs/2004.14696v2
- Date: Mon, 25 Jan 2021 01:35:38 GMT
- Title: Dynamic backup workers for parallel machine learning
- Authors: Chuan Xu, Giovanni Neglia, Nicola Sebastianelli
- Abstract summary: We propose an algorithm that dynamically decides the number of backup workers during the training process to maximize the convergence speed at each iteration.
Our experiments show that DBW 1) removes the necessity to tune $b$ by preliminary time-consuming experiments, and 2) makes the training up to a factor $3$ faster than the optimal static configuration.
- Score: 10.813576865492767
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The most popular framework for distributed training of machine learning
models is the (synchronous) parameter server (PS). This paradigm consists of
$n$ workers, which iteratively compute updates of the model parameters, and a
stateful PS, which waits and aggregates all updates to generate a new estimate
of model parameters and sends it back to the workers for a new iteration.
Transient computation slowdowns or transmission delays can intolerably lengthen
the time of each iteration. An efficient way to mitigate this problem is to let
the PS wait only for the fastest $n-b$ updates, before generating the new
parameters. The slowest $b$ workers are called backup workers. The optimal
number $b$ of backup workers depends on the cluster configuration and workload,
but also (as we show in this paper) on the hyper-parameters of the learning
algorithm and the current stage of the training. We propose DBW, an algorithm
that dynamically decides the number of backup workers during the training
process to maximize the convergence speed at each iteration. Our experiments
show that DBW 1) removes the necessity to tune $b$ by preliminary
time-consuming experiments, and 2) makes the training up to a factor $3$ faster
than the optimal static configuration.
Related papers
- $\Delta$-Patching: A Framework for Rapid Adaptation of Pre-trained
Convolutional Networks without Base Performance Loss [71.46601663956521]
Models pre-trained on large-scale datasets are often fine-tuned to support newer tasks and datasets that arrive over time.
We propose $Delta$-Patching for fine-tuning neural network models in an efficient manner, without the need to store model copies.
Our experiments show that $Delta$-Networks outperform earlier model patching work while only requiring a fraction of parameters to be trained.
arXiv Detail & Related papers (2023-03-26T16:39:44Z) - MAP: Memory-aware Automated Intra-op Parallel Training For Foundation
Models [15.256207550970501]
We introduce MAP, a compiler built upon PyTorch to implement Memory-aware Automated Parallelization.
Compared with existing methods, MAP provides an easy-to-use symbolic profiler to generate memory and computing statistics of an arbitrary PyTorch model.
arXiv Detail & Related papers (2023-02-06T07:22:49Z) - Dimensionality Reduced Training by Pruning and Freezing Parts of a Deep
Neural Network, a Survey [69.3939291118954]
State-of-the-art deep learning models have a parameter count that reaches into the billions. Training, storing and transferring such models is energy and time consuming, thus costly.
Model compression lowers storage and transfer costs, and can further make training more efficient by decreasing the number of computations in the forward and/or backward pass.
This work is a survey on methods which reduce the number of trained weights in deep learning models throughout the training.
arXiv Detail & Related papers (2022-05-17T05:37:08Z) - Efficient Distributed Machine Learning via Combinatorial Multi-Armed
Bandits [23.289979018463406]
We consider a distributed gradient descent problem where a main node distributes gradient calculations among $n$ workers from which at most $b leq n$ can be utilized in parallel.
By assigning tasks to all the workers and waiting only for the $k$ fastest ones, the main node can trade-off the error of the algorithm with its runtime by gradually increasing $k$ as the algorithm evolves.
This strategy, referred to as adaptive k sync, can incur additional costs since it ignores the computational efforts of slow workers.
We propose a cost-efficient scheme that assigns tasks only to $k$
arXiv Detail & Related papers (2022-02-16T19:18:19Z) - Optimizer Fusion: Efficient Training with Better Locality and
Parallelism [11.656318345362804]
Experimental results show that we can achieve an up to 20% training time reduction on various configurations.
Since our methods do not alter the algorithm, they can be used as a general "plug-in" technique to the training process.
arXiv Detail & Related papers (2021-04-01T03:44:13Z) - Straggler-Resilient Distributed Machine Learning with Dynamic Backup
Workers [9.919012793724628]
We propose a fully distributed algorithm to determine the number of backup workers for each worker.
Our algorithm achieves a linear speedup for convergence (i.e., convergence performance increases linearly with respect to the number of workers)
arXiv Detail & Related papers (2021-02-11T21:39:53Z) - Timely Communication in Federated Learning [65.1253801733098]
We consider a global learning framework in which a parameter server (PS) trains a global model by using $n$ clients without actually storing the client data centrally at a cloud server.
Under the proposed scheme, at each iteration, the PS waits for $m$ available clients and sends them the current model.
We find the average age of information experienced by each client and numerically characterize the age-optimal $m$ and $k$ values for a given $n$.
arXiv Detail & Related papers (2020-12-31T18:52:08Z) - Training Recommender Systems at Scale: Communication-Efficient Model and
Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training.
DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively.
It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z) - Tasks, stability, architecture, and compute: Training more effective
learned optimizers, and using them to train themselves [53.37905268850274]
We introduce a new, hierarchical, neural network parameterized, hierarchical with access to additional features such as validation loss to enable automatic regularization.
Most learneds have been trained on only a single task, or a small number of tasks.
We train ours on thousands of tasks, making use of orders of magnitude more compute, resulting in generalizes that perform better to unseen tasks.
arXiv Detail & Related papers (2020-09-23T16:35:09Z) - Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of
Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models.
This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.