Related papers: Improving Efficiency in Large-Scale Decentralized Distributed Training

Improving Efficiency in Large-Scale Decentralized Distributed Training

URL: http://arxiv.org/abs/2002.01119v1
Date: Tue, 4 Feb 2020 04:29:09 GMT
Title: Improving Efficiency in Large-Scale Decentralized Distributed Training
Authors: Wei Zhang, Xiaodong Cui, Abdullah Kayi, Mingrui Liu, Ulrich Finkler, Brian Kingsbury, George Saon, Youssef Mroueh, Alper Buyuktosunoglu, Payel Das, David Kung, Michael Picheny
Abstract summary: We propose techniques to accelerate (A)D-PSGD based training by improving the spectral gap while minimizing the communication cost. We demonstrate the effectiveness of our proposed techniques by running experiments on the 2000-hour Switchboard speech recognition task and the ImageNet computer vision task.
Score: 58.80224380923698
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Decentralized Parallel SGD (D-PSGD) and its asynchronous variant Asynchronous Parallel SGD (AD-PSGD) is a family of distributed learning algorithms that have been demonstrated to perform well for large-scale deep learning tasks. One drawback of (A)D-PSGD is that the spectral gap of the mixing matrix decreases when the number of learners in the system increases, which hampers convergence. In this paper, we investigate techniques to accelerate (A)D-PSGD based training by improving the spectral gap while minimizing the communication cost. We demonstrate the effectiveness of our proposed techniques by running experiments on the 2000-hour Switchboard speech recognition task and the ImageNet computer vision task. On an IBM P9 supercomputer, our system is able to train an LSTM acoustic model in 2.28 hours with 7.5% WER on the Hub5-2000 Switchboard (SWB) test set and 13.3% WER on the CallHome (CH) test set using 64 V100 GPUs and in 1.98 hours with 7.7% WER on SWB and 13.3% WER on CH using 128 V100 GPUs, the fastest training time reported to date.

Related papers

Pseudo-Asynchronous Local SGD: Robust and Efficient Data-Parallel Training [25.025458975145757]
We propose a method called PseudosynchronousA Local SGD (PALSGD) to improve the efficiency of dataparallel training. PALSGD allows the use of longer synchronization intervals compared to standard Local SGD. Our results show that PALSGD achieves better performance in less time compared to existing methods.
arXiv Detail & Related papers (2025-04-25T16:06:08Z)
A New Perspective on Time Series Anomaly Detection: Faster Patch-based Broad Learning System [59.38402187365612]
Time series anomaly detection (TSAD) has been a research hotspot in both academia and industry in recent years. Deep learning is not required for TSAD due to limitations such as slow deep learning speed. We propose Contrastive Patch-based Broad Learning System (CBLS)
arXiv Detail & Related papers (2024-12-07T01:58:18Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
Low-Latency Cooperative Spectrum Sensing via Truncated Vertical Federated Learning [51.51440623636274]
We propose a vertical federated learning (VFL) framework to exploit the distributed features across multiple secondary users (SUs) without compromising data privacy. To accelerate the training process, we propose a truncated vertical federated learning (T-VFL) algorithm. The convergence performance of T-VFL is provided via mathematical analysis and justified by simulation results.
arXiv Detail & Related papers (2022-08-07T10:39:27Z)
FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems [62.20308752994373]
We propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs) Our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6 nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to the baseline system using conventional NICs.
arXiv Detail & Related papers (2022-04-22T21:57:00Z)
Loss Landscape Dependent Self-Adjusting Learning Rates in Decentralized Stochastic Gradient Descent [37.52828820578212]
Distributed Deep Learning (DDL) is essential for large-scale Deep Learning (DL) training. In a large batch setting, the learning rate must be increased to compensate for the reduced number of parameter updates. Recently, Decentralized Parallel SGD (DPSGD) has been proposed to improve training speed.
arXiv Detail & Related papers (2021-12-02T17:23:25Z)
Asynchronous Decentralized Distributed Training of Acoustic Models [43.34839658423581]
We study three variants of asynchronous decentralized parallel SGD (ADPSGD) We show that ADPSGD with fixed and randomized communication patterns cope well with slow learners. In particular, using the delay-by-one strategy, we can train the acoustic model in less than 2 hours.
arXiv Detail & Related papers (2021-10-21T15:14:58Z)
Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks [13.552262050816616]
Kronecker-Factored Approximate Curvature (KFAC) is one of the most efficient approximation algorithms for training deep models. Yet, when leveraging GPU clusters to train models with KFAC, it incurs extensive computation as well as introduces extra communications during each iteration. We propose D-KFAC with smart parallelism of computing and communication tasks to reduce the iteration time.
arXiv Detail & Related papers (2021-07-14T08:01:07Z)
Learning to Efficiently Sample from Diffusion Probabilistic Models [49.58748345998702]
Denoising Diffusion Probabilistic Models (DDPMs) can yield high-fidelity samples and competitive log-likelihoods across a range of domains. We introduce an exact dynamic programming algorithm that finds the optimal discrete time schedules for any pre-trained DDPM.
arXiv Detail & Related papers (2021-06-07T17:15:07Z)
Adaptive Periodic Averaging: A Practical Approach to Reducing Communication in Distributed Learning [6.370766463380455]
We show that the optimal averaging period in terms of convergence and communication cost is not a constant, but instead varies over the course of the execution. We propose a practical algorithm, named ADaptive Periodic parameter averaging SGD (ADPSGD), to achieve a smaller overall variance of model parameters.
arXiv Detail & Related papers (2020-07-13T00:04:55Z)
DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging [4.652668321425679]
Minibatch gradient descent (SGD) algorithm requires workers to halt forward/back propagations. DaSGD parallelizes SGD and forward/back propagations to hide 100% of the communication overhead.
arXiv Detail & Related papers (2020-05-31T05:43:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.