Accelerating Parallel Stochastic Gradient Descent via Non-blocking
Mini-batches
- URL: http://arxiv.org/abs/2211.00889v1
- Date: Wed, 2 Nov 2022 05:25:01 GMT
- Title: Accelerating Parallel Stochastic Gradient Descent via Non-blocking
Mini-batches
- Authors: Haoze He, Parijat Dube
- Abstract summary: Non-blocking SGD can address the straggler problem in a heterogeneous environment.
Non-blocking SGD takes up to 2x fewer time to reach the same training loss in a heterogeneous environment.
- Score: 3.736244431175932
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: SOTA decentralized SGD algorithms can overcome the bandwidth bottleneck at
the parameter server by using communication collectives like Ring All-Reduce
for synchronization. While the parameter updates in distributed SGD may happen
asynchronously there is still a synchronization barrier to make sure that the
local training epoch at every learner is complete before the learners can
advance to the next epoch. The delays in waiting for the slowest
learners(stragglers) remain to be a problem in the synchronization steps of
these state-of-the-art decentralized frameworks. In this paper, we propose the
(de)centralized Non-blocking SGD (Non-blocking SGD) which can address the
straggler problem in a heterogeneous environment. The main idea of Non-blocking
SGD is to split the original batch into mini-batches, then accumulate the
gradients and update the model based on finished mini-batches. The Non-blocking
idea can be implemented using decentralized algorithms including Ring
All-reduce, D-PSGD, and MATCHA to solve the straggler problem. Moreover, using
gradient accumulation to update the model also guarantees convergence and
avoids gradient staleness. Run-time analysis with random straggler delays and
computational efficiency/throughput of devices is also presented to show the
advantage of Non-blocking SGD. Experiments on a suite of datasets and deep
learning networks validate the theoretical analyses and demonstrate that
Non-blocking SGD speeds up the training and fastens the convergence. Compared
with the state-of-the-art decentralized asynchronous algorithms like D-PSGD and
MACHA, Non-blocking SGD takes up to 2x fewer time to reach the same training
loss in a heterogeneous environment.
Related papers
- Digital Twin-Assisted Federated Learning with Blockchain in Multi-tier Computing Systems [67.14406100332671]
In Industry 4.0 systems, resource-constrained edge devices engage in frequent data interactions.
This paper proposes a digital twin (DT) and federated digital twin (FL) scheme.
The efficacy of our proposed cooperative interference-based FL process has been verified through numerical analysis.
arXiv Detail & Related papers (2024-11-04T17:48:02Z) - Distributed Stochastic Gradient Descent with Staleness: A Stochastic Delay Differential Equation Based Framework [56.82432591933544]
Distributed gradient descent (SGD) has attracted considerable recent attention due to its potential for scaling computational resources, reducing training time, and helping protect user privacy in machine learning.
This paper presents the run time and staleness of distributed SGD based on delay differential equations (SDDEs) and the approximation of gradient arrivals.
It is interestingly shown that increasing the number of activated workers does not necessarily accelerate distributed SGD due to staleness.
arXiv Detail & Related papers (2024-06-17T02:56:55Z) - ABS-SGD: A Delayed Synchronous Stochastic Gradient Descent Algorithm
with Adaptive Batch Size for Heterogeneous GPU Clusters [9.885668723959125]
We propose a delayed synchronous distributed gradient descent algorithm with adaptive batch size (ABS-SGD) for heterogeneous GPU clusters.
In ABS-SGD, workers perform global synchronization to accumulate delayed gradients and use the accumulated delayed gradients to update parameters.
Extensive experiments in three types of heterogeneous clusters demonstrate that ABS-SGD can make full use of computational resources.
arXiv Detail & Related papers (2023-08-29T09:46:52Z) - Towards Understanding the Generalizability of Delayed Stochastic
Gradient Descent [63.43247232708004]
A gradient descent performed in an asynchronous manner plays a crucial role in training large-scale machine learning models.
Existing generalization error bounds are rather pessimistic and cannot reveal the correlation between asynchronous delays and generalization.
Our theoretical results indicate that asynchronous delays reduce the generalization error of the delayed SGD algorithm.
arXiv Detail & Related papers (2023-08-18T10:00:27Z) - Asynchronous SGD Beats Minibatch SGD Under Arbitrary Delays [8.46491234455848]
We prove much better guarantees for the same asynchronous gradient regardless of the delays in steps, depending instead just on the number of steps.
For our analysis, we introduce a novel based on "virtual steps" and delay iterations, which allow us to derive state-of-the-art guarantees for both convex non-adaptive gradients.
arXiv Detail & Related papers (2022-06-15T16:28:37Z) - Distributed stochastic optimization with large delays [59.95552973784946]
One of the most widely used methods for solving large-scale optimization problems is distributed asynchronous gradient descent (DASGD)
We show that DASGD converges to a global optimal implementation model under same delay assumptions.
arXiv Detail & Related papers (2021-07-06T21:59:49Z) - Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge
Computing [113.52575069030192]
Big data, including applications with high security requirements, are often collected and stored on multiple heterogeneous devices, such as mobile devices, drones and vehicles.
Due to the limitations of communication costs and security requirements, it is of paramount importance to extract information in a decentralized manner instead of aggregating data to a fusion center.
We consider the problem of learning model parameters in a multi-agent system with data locally processed via distributed edge nodes.
A class of mini-batch alternating direction method of multipliers (ADMM) algorithms is explored to develop the distributed learning model.
arXiv Detail & Related papers (2020-10-02T10:41:59Z) - DaSGD: Squeezing SGD Parallelization Performance in Distributed Training
Using Delayed Averaging [4.652668321425679]
Minibatch gradient descent (SGD) algorithm requires workers to halt forward/back propagations.
DaSGD parallelizes SGD and forward/back propagations to hide 100% of the communication overhead.
arXiv Detail & Related papers (2020-05-31T05:43:50Z) - Detached Error Feedback for Distributed SGD with Random Sparsification [98.98236187442258]
Communication bottleneck has been a critical problem in large-scale deep learning.
We propose a new distributed error feedback (DEF) algorithm, which shows better convergence than error feedback for non-efficient distributed problems.
We also propose DEFA to accelerate the generalization of DEF, which shows better bounds than DEF.
arXiv Detail & Related papers (2020-04-11T03:50:59Z) - Slow and Stale Gradients Can Win the Race [39.750046808758526]
Distributed Gradient Descent (SGD) when run in a synchronous manner, suffers from delays in runtime as it waits for the slowest workers (stragglers)
Asynchronous methods can alleviate stragglers, but cause gradient staleness that can adversely affect the convergence error.
We present a novel theoretical characterization of the speedup offered by asynchronous methods by analyzing the trade-off between the error in the trained model and the actual training runtime.
arXiv Detail & Related papers (2020-03-23T23:27:50Z) - Overlap Local-SGD: An Algorithmic Approach to Hide Communication Delays
in Distributed SGD [32.03967072200476]
We propose an algorithmic approach named OverlapLocal-Local-Local-SGD (Local momentum variant)
We achieve this by adding an anchor model on each node.
After multiple local updates, locally trained models will be pulled back towards the anchor model rather than communicating with others.
arXiv Detail & Related papers (2020-02-21T20:33:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.