Efficient Distributed Machine Learning via Combinatorial Multi-Armed
Bandits
- URL: http://arxiv.org/abs/2202.08302v1
- Date: Wed, 16 Feb 2022 19:18:19 GMT
- Title: Efficient Distributed Machine Learning via Combinatorial Multi-Armed
Bandits
- Authors: Maximilian Egger, Rawad Bitar, Antonia Wachter-Zeh and Deniz
G\"und\"uz
- Abstract summary: We consider a distributed gradient descent problem where a main node distributes gradient calculations among $n$ workers from which at most $b leq n$ can be utilized in parallel.
By assigning tasks to all the workers and waiting only for the $k$ fastest ones, the main node can trade-off the error of the algorithm with its runtime by gradually increasing $k$ as the algorithm evolves.
This strategy, referred to as adaptive k sync, can incur additional costs since it ignores the computational efforts of slow workers.
We propose a cost-efficient scheme that assigns tasks only to $k$
- Score: 23.289979018463406
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the distributed stochastic gradient descent problem, where a main
node distributes gradient calculations among $n$ workers from which at most $b
\leq n$ can be utilized in parallel. By assigning tasks to all the workers and
waiting only for the $k$ fastest ones, the main node can trade-off the error of
the algorithm with its runtime by gradually increasing $k$ as the algorithm
evolves. However, this strategy, referred to as adaptive k sync, can incur
additional costs since it ignores the computational efforts of slow workers. We
propose a cost-efficient scheme that assigns tasks only to $k$ workers and
gradually increases $k$. As the response times of the available workers are
unknown to the main node a priori, we utilize a combinatorial multi-armed
bandit model to learn which workers are the fastest while assigning gradient
calculations, and to minimize the effect of slow workers. Assuming that the
mean response times of the workers are independent and exponentially
distributed with different means, we give empirical and theoretical guarantees
on the regret of our strategy, i.e., the extra time spent to learn the mean
response times of the workers. Compared to adaptive k sync, our scheme achieves
significantly lower errors with the same computational efforts while being
inferior in terms of speed.
Related papers
- DropCompute: simple and more robust distributed synchronous training via
compute variance reduction [30.46681332866494]
We study a typical scenario in which workers are straggling due to variability in compute time.
We propose a simple yet effective decentralized method to reduce the variation among workers and thus improve the robustness of synchronous training.
arXiv Detail & Related papers (2023-06-18T16:55:31Z) - Fast and Straggler-Tolerant Distributed SGD with Reduced Computation
Load [11.069252535469644]
optimization procedures like gradient descent (SGD) can be leveraged to mitigate the effect of unresponsive or slow workers called stragglers.
This can be done by only waiting for a subset of the workers to finish their computation at each iteration of the algorithm.
We construct a novel scheme that adapts both the number of workers and the computation load throughout the run-time of the algorithm.
arXiv Detail & Related papers (2023-04-17T20:12:18Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - Sharper Convergence Guarantees for Asynchronous SGD for Distributed and
Federated Learning [77.22019100456595]
We show a training algorithm for distributed computation workers with varying communication frequency.
In this work, we obtain a tighter convergence rate of $mathcalO!!!(sigma2-2_avg!! .
We also show that the heterogeneity term in rate is affected by the average delay within each worker.
arXiv Detail & Related papers (2022-06-16T17:10:57Z) - Coded Computation across Shared Heterogeneous Workers with Communication
Delay [42.50248255900814]
We consider a multi-worker distributed computing scenario, where multiple matrix multiplication tasks are encoded and allocated to workers for parallel computation.
We propose worker assignment, resource allocation load allocation algorithms under both dedicated and fractional worker assignment policies, where each worker can process the encoded encoded tasks.
We show that the proposed algorithms can reduce the task delay completion compared to benchmarks, and observe that dedicated and fractional worker assignment policies have different scopes of applications.
arXiv Detail & Related papers (2021-09-23T09:40:54Z) - Asynchronous Stochastic Optimization Robust to Arbitrary Delays [54.61797739710608]
We consider optimization with delayed gradients where, at each time stept$, the algorithm makes an update using a stale computation - d_t$ for arbitrary delay $d_t gradient.
Our experiments demonstrate the efficacy and robustness of our algorithm in cases where the delay distribution is skewed or heavy-tailed.
arXiv Detail & Related papers (2021-06-22T15:50:45Z) - Gradient Coding with Dynamic Clustering for Straggler-Tolerant
Distributed Learning [55.052517095437]
gradient descent (GD) is widely employed to parallelize the learning task by distributing the dataset across multiple workers.
A significant performance bottleneck for the per-iteration completion time in distributed synchronous GD is $straggling$ workers.
Coded distributed techniques have been introduced recently to mitigate stragglers and to speed up GD iterations by assigning redundant computations to workers.
We propose a novel dynamic GC scheme, which assigns redundant data to workers to acquire the flexibility to choose from among a set of possible codes depending on the past straggling behavior.
arXiv Detail & Related papers (2021-03-01T18:51:29Z) - Straggler-Resilient Distributed Machine Learning with Dynamic Backup
Workers [9.919012793724628]
We propose a fully distributed algorithm to determine the number of backup workers for each worker.
Our algorithm achieves a linear speedup for convergence (i.e., convergence performance increases linearly with respect to the number of workers)
arXiv Detail & Related papers (2021-02-11T21:39:53Z) - Straggler-aware Distributed Learning: Communication Computation Latency
Trade-off [56.08535873173518]
Straggling workers can be tolerated by assigning redundant computations and coding across data and computations.
In most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations.
Imposing such a limitation results in two main drawbacks; over-computation due to inaccurate prediction of the straggling behaviour, and under-utilization due to treating workers as straggler/non-straggler.
arXiv Detail & Related papers (2020-04-10T08:39:36Z) - Variance Reduction with Sparse Gradients [82.41780420431205]
Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients.
We introduce a new sparsity operator: The random-top-k operator.
Our algorithm consistently outperforms SpiderBoost on various tasks including image classification, natural language processing, and sparse matrix factorization.
arXiv Detail & Related papers (2020-01-27T08:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.