$\textbf{A}^2\textbf{CiD}^2$: Accelerating Asynchronous Communication in
Decentralized Deep Learning
- URL: http://arxiv.org/abs/2306.08289v2
- Date: Wed, 6 Dec 2023 07:48:15 GMT
- Title: $\textbf{A}^2\textbf{CiD}^2$: Accelerating Asynchronous Communication in
Decentralized Deep Learning
- Authors: Adel Nabli (MLIA, Mila), Eugene Belilovsky (Mila), Edouard Oyallon
(MLIA)
- Abstract summary: We introduce a principled asynchronous, randomized, gossip-based optimization algorithm which works thanks to a continuous local momentum named $textbfA2textbfCiD2$.
Our theoretical analysis proves accelerated rates compared to previous asynchronous decentralized baselines.
We show consistent improvement on the ImageNet dataset using up to 64 asynchronous workers.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distributed training of Deep Learning models has been critical to many recent
successes in the field. Current standard methods primarily rely on synchronous
centralized algorithms which induce major communication bottlenecks and
synchronization locks at scale. Decentralized asynchronous algorithms are
emerging as a potential alternative but their practical applicability still
lags. In order to mitigate the increase in communication cost that naturally
comes with scaling the number of workers, we introduce a principled
asynchronous, randomized, gossip-based optimization algorithm which works
thanks to a continuous local momentum named $\textbf{A}^2\textbf{CiD}^2$. Our
method allows each worker to continuously process mini-batches without
stopping, and run a peer-to-peer averaging routine in parallel, reducing idle
time. In addition to inducing a significant communication acceleration at no
cost other than adding a local momentum variable, minimal adaptation is
required to incorporate $\textbf{A}^2\textbf{CiD}^2$ to standard asynchronous
approaches. Our theoretical analysis proves accelerated rates compared to
previous asynchronous decentralized baselines and we empirically show that
using our $\textbf{A}^2\textbf{CiD}^2$ momentum significantly decrease
communication costs in poorly connected networks. In particular, we show
consistent improvement on the ImageNet dataset using up to 64 asynchronous
workers (A100 GPUs) and various communication network topologies.
Related papers
- ACCO: Accumulate while you Communicate, Hiding Communications in Distributed LLM Training [16.560270624096706]
We propose a memory-efficient optimization algorithm tailored for distributed training of Large Language Models.
Our method relies on a novel technique to mitigate the one-step delay inherent in parallel execution of gradient computations and communications.
arXiv Detail & Related papers (2024-06-03T08:23:45Z) - Shadowheart SGD: Distributed Asynchronous SGD with Optimal Time Complexity Under Arbitrary Computation and Communication Heterogeneity [85.92481138826949]
We develop a new method-Shadowheart SGD-that provably improves the time complexities of all previous centralized methods.
We also consider the bidirectional setup, where broadcasting from the server to the workers is non-negligible, and develop a corresponding method.
arXiv Detail & Related papers (2024-02-07T12:15:56Z) - Ravnest: Decentralized Asynchronous Training on Heterogeneous Devices [0.0]
Ravnest facilitates decentralized training by efficiently organizing compute nodes into clusters.
We have framed our asynchronous SGD loss function as a block structured optimization problem with delayed updates.
arXiv Detail & Related papers (2024-01-03T13:07:07Z) - Communication-Efficient Decentralized Federated Learning via One-Bit
Compressive Sensing [52.402550431781805]
Decentralized federated learning (DFL) has gained popularity due to its practicality across various applications.
Compared to the centralized version, training a shared model among a large number of nodes in DFL is more challenging.
We develop a novel algorithm based on the framework of the inexact alternating direction method (iADM)
arXiv Detail & Related papers (2023-08-31T12:22:40Z) - Accelerating Distributed ML Training via Selective Synchronization [0.0]
textttSelSync is a practical, low-overhead method for DNN training that dynamically chooses to incur or avoid communication at each step.
Our system converges to the same or better accuracy than BSP while reducing training time by up to 14$times$.
arXiv Detail & Related papers (2023-07-16T05:28:59Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization [0.0]
DADAO is the first decentralized, accelerated, asynchronous, primal, first-order algorithm to minimize a sum of $L$-smooth and $mu$-strongly convex functions distributed over a given network of size $n$.
We show that our algorithm requires $mathcalO(nsqrtchisqrtfracLmulog(frac1epsilon)$ local and only $mathcalO(nsqrtchisqrtfracLmulog(
arXiv Detail & Related papers (2022-07-26T08:47:54Z) - Locally Asynchronous Stochastic Gradient Descent for Decentralised Deep
Learning [0.0]
Local Asynchronous SGD (LASGD) is an asynchronous decentralized algorithm that relies on All Reduce for model synchronization.
We empirically validate LASGD's performance on image classification tasks on the ImageNet dataset.
arXiv Detail & Related papers (2022-03-24T14:25:15Z) - Asynchronous Parallel Incremental Block-Coordinate Descent for
Decentralized Machine Learning [55.198301429316125]
Machine learning (ML) is a key technique for big-data-driven modelling and analysis of massive Internet of Things (IoT) based intelligent and ubiquitous computing.
For fast-increasing applications and data amounts, distributed learning is a promising emerging paradigm since it is often impractical or inefficient to share/aggregate data.
This paper studies the problem of training an ML model over decentralized systems, where data are distributed over many user devices.
arXiv Detail & Related papers (2022-02-07T15:04:15Z) - Acceleration in Distributed Optimization Under Similarity [72.54787082152278]
We study distributed (strongly convex) optimization problems over a network of agents, with no centralized nodes.
An $varepsilon$-solution is achieved in $tildemathcalrhoObig(sqrtfracbeta/mu (1-)log1/varepsilonbig)$ number of communications steps.
This rate matches (up to poly-log factors) for the first time lower complexity communication bounds of distributed gossip-algorithms applied to the class of problems of interest.
arXiv Detail & Related papers (2021-10-24T04:03:00Z) - A Linearly Convergent Algorithm for Decentralized Optimization: Sending
Less Bits for Free! [72.31332210635524]
Decentralized optimization methods enable on-device training of machine learning models without a central coordinator.
We propose a new randomized first-order method which tackles the communication bottleneck by applying randomized compression operators.
We prove that our method can solve the problems without any increase in the number of communications compared to the baseline.
arXiv Detail & Related papers (2020-11-03T13:35:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.