DropCompute: simple and more robust distributed synchronous training via
compute variance reduction
- URL: http://arxiv.org/abs/2306.10598v2
- Date: Sun, 24 Sep 2023 07:15:29 GMT
- Title: DropCompute: simple and more robust distributed synchronous training via
compute variance reduction
- Authors: Niv Giladi, Shahar Gottlieb, Moran Shkolnik, Asaf Karnieli, Ron
Banner, Elad Hoffer, Kfir Yehuda Levy, Daniel Soudry
- Abstract summary: We study a typical scenario in which workers are straggling due to variability in compute time.
We propose a simple yet effective decentralized method to reduce the variation among workers and thus improve the robustness of synchronous training.
- Score: 30.46681332866494
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Background: Distributed training is essential for large scale training of
deep neural networks (DNNs). The dominant methods for large scale DNN training
are synchronous (e.g. All-Reduce), but these require waiting for all workers in
each step. Thus, these methods are limited by the delays caused by straggling
workers. Results: We study a typical scenario in which workers are straggling
due to variability in compute time. We find an analytical relation between
compute time properties and scalability limitations, caused by such straggling
workers. With these findings, we propose a simple yet effective decentralized
method to reduce the variation among workers and thus improve the robustness of
synchronous training. This method can be integrated with the widely used
All-Reduce. Our findings are validated on large-scale training tasks using 200
Gaudi Accelerators.
Related papers
- Fast and Straggler-Tolerant Distributed SGD with Reduced Computation
Load [11.069252535469644]
optimization procedures like gradient descent (SGD) can be leveraged to mitigate the effect of unresponsive or slow workers called stragglers.
This can be done by only waiting for a subset of the workers to finish their computation at each iteration of the algorithm.
We construct a novel scheme that adapts both the number of workers and the computation load throughout the run-time of the algorithm.
arXiv Detail & Related papers (2023-04-17T20:12:18Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - Efficient Distributed Machine Learning via Combinatorial Multi-Armed
Bandits [23.289979018463406]
We consider a distributed gradient descent problem where a main node distributes gradient calculations among $n$ workers from which at most $b leq n$ can be utilized in parallel.
By assigning tasks to all the workers and waiting only for the $k$ fastest ones, the main node can trade-off the error of the algorithm with its runtime by gradually increasing $k$ as the algorithm evolves.
This strategy, referred to as adaptive k sync, can incur additional costs since it ignores the computational efforts of slow workers.
We propose a cost-efficient scheme that assigns tasks only to $k$
arXiv Detail & Related papers (2022-02-16T19:18:19Z) - RelaySum for Decentralized Deep Learning on Heterogeneous Data [71.36228931225362]
In decentralized machine learning, workers compute model updates on their local data.
Because the workers only communicate with few neighbors without central coordination, these updates propagate progressively over the network.
This paradigm enables distributed training on networks without all-to-all connectivity, helping to protect data privacy as well as to reduce the communication cost of distributed training in data centers.
arXiv Detail & Related papers (2021-10-08T14:55:32Z) - Distribution Mismatch Correction for Improved Robustness in Deep Neural
Networks [86.42889611784855]
normalization methods increase the vulnerability with respect to noise and input corruptions.
We propose an unsupervised non-parametric distribution correction method that adapts the activation distribution of each layer.
In our experiments, we empirically show that the proposed method effectively reduces the impact of intense image corruptions.
arXiv Detail & Related papers (2021-10-05T11:36:25Z) - Distributed Optimization using Heterogeneous Compute Systems [0.0]
We consider the training of deep neural networks on a distributed system of workers with varying compute power.
A naive implementation of synchronous distributed training will result in the faster workers waiting for the slowest worker to complete processing.
We propose to dynamically adjust the data assigned for each worker during the training.
arXiv Detail & Related papers (2021-10-03T11:21:49Z) - What training reveals about neural network complexity [80.87515604428346]
This work explores the hypothesis that the complexity of the function a deep neural network (NN) is learning can be deduced by how fast its weights change during training.
Our results support the hypothesis that good training behavior can be a useful bias towards good generalization.
arXiv Detail & Related papers (2021-06-08T08:58:00Z) - A Low Complexity Decentralized Neural Net with Centralized Equivalence
using Layer-wise Learning [49.15799302636519]
We design a low complexity decentralized learning algorithm to train a recently proposed large neural network in distributed processing nodes (workers)
In our setup, the training data is distributed among the workers but is not shared in the training process due to privacy and security concerns.
We show that it is possible to achieve equivalent learning performance as if the data is available in a single place.
arXiv Detail & Related papers (2020-09-29T13:08:12Z) - PSO-PS: Parameter Synchronization with Particle Swarm Optimization for
Distributed Training of Deep Neural Networks [16.35607080388805]
We propose a new algorithm of integrating Particle Swarm Optimization into the distributed training process of Deep Neural Networks (DNNs)
In the proposed algorithm, a computing work is encoded by a particle, the weights of DNNs and the training loss are modeled by the particle attributes.
At each synchronization stage, the weights are updated by PSO from the sub weights gathered from all workers, instead of averaging the weights or the gradients.
arXiv Detail & Related papers (2020-09-06T05:18:32Z) - DBS: Dynamic Batch Size For Distributed Deep Neural Network Training [19.766163856388694]
We propose the Dynamic Batch Size (DBS) strategy for the distributedtraining of Deep Neural Networks (DNNs)
Specifically, the performance of each worker is evaluatedfirst based on the fact in the previous epoch, and then the batch size and dataset partition are dynamically adjusted.
The experimental results indicate that the proposed strategy can fully utilizethe performance of the cluster, reduce the training time, and have good robustness with disturbance by irrelevant tasks.
arXiv Detail & Related papers (2020-07-23T07:31:55Z) - Straggler-aware Distributed Learning: Communication Computation Latency
Trade-off [56.08535873173518]
Straggling workers can be tolerated by assigning redundant computations and coding across data and computations.
In most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations.
Imposing such a limitation results in two main drawbacks; over-computation due to inaccurate prediction of the straggling behaviour, and under-utilization due to treating workers as straggler/non-straggler.
arXiv Detail & Related papers (2020-04-10T08:39:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.