Related papers: Asynchronous Decentralized Distributed Training of Acoustic Models

Asynchronous Decentralized Distributed Training of Acoustic Models

URL: http://arxiv.org/abs/2110.11199v1
Date: Thu, 21 Oct 2021 15:14:58 GMT
Title: Asynchronous Decentralized Distributed Training of Acoustic Models
Authors: Xiaodong Cui, Wei Zhang, Abdullah Kayi, Mingrui Liu, Ulrich Finkler, Brian Kingsbury, George Saon, David Kung
Abstract summary: We study three variants of asynchronous decentralized parallel SGD (ADPSGD) We show that ADPSGD with fixed and randomized communication patterns cope well with slow learners. In particular, using the delay-by-one strategy, we can train the acoustic model in less than 2 hours.
Score: 43.34839658423581
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large-scale distributed training of deep acoustic models plays an important role in today's high-performance automatic speech recognition (ASR). In this paper we investigate a variety of asynchronous decentralized distributed training strategies based on data parallel stochastic gradient descent (SGD) to show their superior performance over the commonly-used synchronous distributed training via allreduce, especially when dealing with large batch sizes. Specifically, we study three variants of asynchronous decentralized parallel SGD (ADPSGD), namely, fixed and randomized communication patterns on a ring as well as a delay-by-one scheme. We introduce a mathematical model of ADPSGD, give its theoretical convergence rate, and compare the empirical convergence behavior and straggler resilience properties of the three variants. Experiments are carried out on an IBM supercomputer for training deep long short-term memory (LSTM) acoustic models on the 2000-hour Switchboard dataset. Recognition and speedup performance of the proposed strategies are evaluated under various training configurations. We show that ADPSGD with fixed and randomized communication patterns cope well with slow learners. When learners are equally fast, ADPSGD with the delay-by-one strategy has the fastest convergence with large batches. In particular, using the delay-by-one strategy, we can train the acoustic model in less than 2 hours using 128 V100 GPUs with competitive word error rates.

Related papers

Pseudo-Asynchronous Local SGD: Robust and Efficient Data-Parallel Training [25.025458975145757]
We propose a method called PseudosynchronousA Local SGD (PALSGD) to improve the efficiency of dataparallel training. PALSGD allows the use of longer synchronization intervals compared to standard Local SGD. Our results show that PALSGD achieves better performance in less time compared to existing methods.
arXiv Detail & Related papers (2025-04-25T16:06:08Z)
PSDNorm: Test-Time Temporal Normalization for Deep Learning on EEG Signals [63.05435596565677]
PSDNorm is a layer that leverages Monge mapping and temporal context to normalize feature maps in deep learning models. PSDNorm achieves state-of-the-art performance at test time on datasets not seen during training. PSDNorm provides a significant improvement in robustness, achieving markedly higher F1 scores for the 20% hardest subjects.
arXiv Detail & Related papers (2025-03-06T16:20:25Z)
Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates [1.9241821314180372]
Asynchronous gradient descent (ASGD) methods can improve training speed, but are sensitive to delays due to both communication and throughput differences. PD-ASGD uses separate threads for the forward and backward passes, decoupling the updates and allowing for a higher ratio of forward to backward threads. Our approach yields close to state-of-the-art results while running up to $5.95times$ faster than synchronous data parallelism in the presence of delays.
arXiv Detail & Related papers (2024-10-08T12:32:36Z)
Stragglers-Aware Low-Latency Synchronous Federated Learning via Layer-Wise Model Updates [71.81037644563217]
Synchronous federated learning (FL) is a popular paradigm for collaborative edge learning. As some of the devices may have limited computational resources and varying availability, FL latency is highly sensitive to stragglers. We propose straggler-aware layer-wise federated learning (SALF) that leverages the optimization procedure of NNs via backpropagation to update the global model in a layer-wise fashion.
arXiv Detail & Related papers (2024-03-27T09:14:36Z)
Efficient Diffusion Training via Min-SNR Weighting Strategy [78.5801305960993]
We treat the diffusion training as a multi-task learning problem and introduce a simple yet effective approach referred to as Min-SNR-$gamma$. Our results demonstrate a significant improvement in converging speed, 3.4$times$ faster than previous weighting strategies. It is also more effective, achieving a new record FID score of 2.06 on the ImageNet $256times256$ benchmark using smaller architectures than that employed in previous state-of-the-art.
arXiv Detail & Related papers (2023-03-16T17:59:56Z)
Design and Prototyping Distributed CNN Inference Acceleration in Edge Computing [85.74517957717363]
HALP accelerates inference by designing a seamless collaboration among edge devices (EDs) in Edge Computing. Experiments show that the distributed inference HALP achieves 1.7x inference acceleration for VGG-16. It is shown that the model selection with distributed inference HALP can significantly improve service reliability.
arXiv Detail & Related papers (2022-11-24T19:48:30Z)
Distributed Adversarial Training to Robustify Deep Neural Networks at Scale [100.19539096465101]
Current deep neural networks (DNNs) are vulnerable to adversarial attacks, where adversarial perturbations to the inputs can change or manipulate classification. To defend against such attacks, an effective approach, known as adversarial training (AT), has been shown to mitigate robust training. We propose a large-batch adversarial training framework implemented over multiple machines.
arXiv Detail & Related papers (2022-06-13T15:39:43Z)
DBS: Dynamic Batch Size For Distributed Deep Neural Network Training [19.766163856388694]
We propose the Dynamic Batch Size (DBS) strategy for the distributedtraining of Deep Neural Networks (DNNs) Specifically, the performance of each worker is evaluatedfirst based on the fact in the previous epoch, and then the batch size and dataset partition are dynamically adjusted. The experimental results indicate that the proposed strategy can fully utilizethe performance of the cluster, reduce the training time, and have good robustness with disturbance by irrelevant tasks.
arXiv Detail & Related papers (2020-07-23T07:31:55Z)
Adaptive Periodic Averaging: A Practical Approach to Reducing Communication in Distributed Learning [6.370766463380455]
We show that the optimal averaging period in terms of convergence and communication cost is not a constant, but instead varies over the course of the execution. We propose a practical algorithm, named ADaptive Periodic parameter averaging SGD (ADPSGD), to achieve a smaller overall variance of model parameters.
arXiv Detail & Related papers (2020-07-13T00:04:55Z)
DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging [4.652668321425679]
Minibatch gradient descent (SGD) algorithm requires workers to halt forward/back propagations. DaSGD parallelizes SGD and forward/back propagations to hide 100% of the communication overhead.
arXiv Detail & Related papers (2020-05-31T05:43:50Z)
Distributed Training of Deep Neural Network Acoustic Models for Automatic Speech Recognition [33.032361181388886]
We provide an overview of distributed training techniques for deep neural network acoustic models for ASR. Experiments are carried out on a popular public benchmark to study the convergence, speedup and recognition performance of the investigated strategies.
arXiv Detail & Related papers (2020-02-24T19:31:50Z)
Improving Efficiency in Large-Scale Decentralized Distributed Training [58.80224380923698]
We propose techniques to accelerate (A)D-PSGD based training by improving the spectral gap while minimizing the communication cost. We demonstrate the effectiveness of our proposed techniques by running experiments on the 2000-hour Switchboard speech recognition task and the ImageNet computer vision task.
arXiv Detail & Related papers (2020-02-04T04:29:09Z)
Elastic Consistency: A General Consistency Model for Distributed Stochastic Gradient Descent [28.006781039853575]
A key element behind the progress of machine learning in recent years has been the ability to train machine learning models in largescale distributed-memory environments. In this paper, we introduce general convergence methods used in practice to train large-scale machine learning models. Our framework, called elastic elastic bounds, enables us to derive convergence bounds for a variety of distributed SGD methods.
arXiv Detail & Related papers (2020-01-16T16:10:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.