ABS-SGD: A Delayed Synchronous Stochastic Gradient Descent Algorithm
with Adaptive Batch Size for Heterogeneous GPU Clusters
- URL: http://arxiv.org/abs/2308.15164v1
- Date: Tue, 29 Aug 2023 09:46:52 GMT
- Title: ABS-SGD: A Delayed Synchronous Stochastic Gradient Descent Algorithm
with Adaptive Batch Size for Heterogeneous GPU Clusters
- Authors: Xin Zhou, Ling Chen, Houming Wu
- Abstract summary: We propose a delayed synchronous distributed gradient descent algorithm with adaptive batch size (ABS-SGD) for heterogeneous GPU clusters.
In ABS-SGD, workers perform global synchronization to accumulate delayed gradients and use the accumulated delayed gradients to update parameters.
Extensive experiments in three types of heterogeneous clusters demonstrate that ABS-SGD can make full use of computational resources.
- Score: 9.885668723959125
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As the size of models and datasets grows, it has become increasingly common
to train models in parallel. However, existing distributed stochastic gradient
descent (SGD) algorithms suffer from insufficient utilization of computational
resources and poor convergence in heterogeneous clusters. In this paper, we
propose a delayed synchronous SGD algorithm with adaptive batch size (ABS-SGD)
for heterogeneous GPU clusters. In ABS-SGD, workers perform global
synchronization to accumulate delayed gradients and use the accumulated delayed
gradients to update parameters. While workers are performing global
synchronization for delayed gradients, they perform the computation of the next
batch without specifying batch size in advance, which lasts until the next
global synchronization starts, realizing the full utilization of computational
resources. Since the gradient delay is only one iteration, the stale gradient
problem can be alleviated. We theoretically prove the convergence of ABS-SGD in
heterogeneous clusters. Extensive experiments in three types of heterogeneous
clusters demonstrate that ABS-SGD can make full use of computational resources
and accelerate model convergence: When training ResNet18 network with 4
workers, ABS-SGD increases the convergence speed by 1.30x on average compared
with the best baseline algorithm.
Related papers
- Distributed Stochastic Gradient Descent with Staleness: A Stochastic Delay Differential Equation Based Framework [56.82432591933544]
Distributed gradient descent (SGD) has attracted considerable recent attention due to its potential for scaling computational resources, reducing training time, and helping protect user privacy in machine learning.
This paper presents the run time and staleness of distributed SGD based on delay differential equations (SDDEs) and the approximation of gradient arrivals.
It is interestingly shown that increasing the number of activated workers does not necessarily accelerate distributed SGD due to staleness.
arXiv Detail & Related papers (2024-06-17T02:56:55Z) - AsGrad: A Sharp Unified Analysis of Asynchronous-SGD Algorithms [45.90015262911875]
We analyze asynchronous-type algorithms for distributed SGD in the heterogeneous setting.
As a by-product of our analysis, we also demonstrate guarantees for gradient-type algorithms such as SGD with random tightness.
arXiv Detail & Related papers (2023-10-31T13:44:53Z) - Towards Understanding the Generalizability of Delayed Stochastic
Gradient Descent [63.43247232708004]
A gradient descent performed in an asynchronous manner plays a crucial role in training large-scale machine learning models.
Existing generalization error bounds are rather pessimistic and cannot reveal the correlation between asynchronous delays and generalization.
Our theoretical results indicate that asynchronous delays reduce the generalization error of the delayed SGD algorithm.
arXiv Detail & Related papers (2023-08-18T10:00:27Z) - Accelerating Parallel Stochastic Gradient Descent via Non-blocking
Mini-batches [3.736244431175932]
Non-blocking SGD can address the straggler problem in a heterogeneous environment.
Non-blocking SGD takes up to 2x fewer time to reach the same training loss in a heterogeneous environment.
arXiv Detail & Related papers (2022-11-02T05:25:01Z) - Sharper Convergence Guarantees for Asynchronous SGD for Distributed and
Federated Learning [77.22019100456595]
We show a training algorithm for distributed computation workers with varying communication frequency.
In this work, we obtain a tighter convergence rate of $mathcalO!!!(sigma2-2_avg!! .
We also show that the heterogeneity term in rate is affected by the average delay within each worker.
arXiv Detail & Related papers (2022-06-16T17:10:57Z) - Distributed stochastic optimization with large delays [59.95552973784946]
One of the most widely used methods for solving large-scale optimization problems is distributed asynchronous gradient descent (DASGD)
We show that DASGD converges to a global optimal implementation model under same delay assumptions.
arXiv Detail & Related papers (2021-07-06T21:59:49Z) - Gradient Coding with Dynamic Clustering for Straggler-Tolerant
Distributed Learning [55.052517095437]
gradient descent (GD) is widely employed to parallelize the learning task by distributing the dataset across multiple workers.
A significant performance bottleneck for the per-iteration completion time in distributed synchronous GD is $straggling$ workers.
Coded distributed techniques have been introduced recently to mitigate stragglers and to speed up GD iterations by assigning redundant computations to workers.
We propose a novel dynamic GC scheme, which assigns redundant data to workers to acquire the flexibility to choose from among a set of possible codes depending on the past straggling behavior.
arXiv Detail & Related papers (2021-03-01T18:51:29Z) - Gradient Coding with Dynamic Clustering for Straggler Mitigation [57.9123881133818]
GC-DC regulates the number of straggling workers in each cluster based on the straggler behavior in the previous iteration.
We numerically show that GC-DC provides significant improvements in the average completion time (of each iteration) with no increase in the communication load compared to the original GC scheme.
arXiv Detail & Related papers (2020-11-03T18:52:15Z) - DaSGD: Squeezing SGD Parallelization Performance in Distributed Training
Using Delayed Averaging [4.652668321425679]
Minibatch gradient descent (SGD) algorithm requires workers to halt forward/back propagations.
DaSGD parallelizes SGD and forward/back propagations to hide 100% of the communication overhead.
arXiv Detail & Related papers (2020-05-31T05:43:50Z) - Overlap Local-SGD: An Algorithmic Approach to Hide Communication Delays
in Distributed SGD [32.03967072200476]
We propose an algorithmic approach named OverlapLocal-Local-Local-SGD (Local momentum variant)
We achieve this by adding an anchor model on each node.
After multiple local updates, locally trained models will be pulled back towards the anchor model rather than communicating with others.
arXiv Detail & Related papers (2020-02-21T20:33:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.