Asynchronous Decentralized Distributed Training of Acoustic Models
- URL: http://arxiv.org/abs/2110.11199v1
- Date: Thu, 21 Oct 2021 15:14:58 GMT
- Title: Asynchronous Decentralized Distributed Training of Acoustic Models
- Authors: Xiaodong Cui, Wei Zhang, Abdullah Kayi, Mingrui Liu, Ulrich Finkler,
Brian Kingsbury, George Saon, David Kung
- Abstract summary: We study three variants of asynchronous decentralized parallel SGD (ADPSGD)
We show that ADPSGD with fixed and randomized communication patterns cope well with slow learners.
In particular, using the delay-by-one strategy, we can train the acoustic model in less than 2 hours.
- Score: 43.34839658423581
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale distributed training of deep acoustic models plays an important
role in today's high-performance automatic speech recognition (ASR). In this
paper we investigate a variety of asynchronous decentralized distributed
training strategies based on data parallel stochastic gradient descent (SGD) to
show their superior performance over the commonly-used synchronous distributed
training via allreduce, especially when dealing with large batch sizes.
Specifically, we study three variants of asynchronous decentralized parallel
SGD (ADPSGD), namely, fixed and randomized communication patterns on a ring as
well as a delay-by-one scheme. We introduce a mathematical model of ADPSGD,
give its theoretical convergence rate, and compare the empirical convergence
behavior and straggler resilience properties of the three variants. Experiments
are carried out on an IBM supercomputer for training deep long short-term
memory (LSTM) acoustic models on the 2000-hour Switchboard dataset. Recognition
and speedup performance of the proposed strategies are evaluated under various
training configurations. We show that ADPSGD with fixed and randomized
communication patterns cope well with slow learners. When learners are equally
fast, ADPSGD with the delay-by-one strategy has the fastest convergence with
large batches. In particular, using the delay-by-one strategy, we can train the
acoustic model in less than 2 hours using 128 V100 GPUs with competitive word
error rates.
Related papers
- Stragglers-Aware Low-Latency Synchronous Federated Learning via Layer-Wise Model Updates [71.81037644563217]
Synchronous federated learning (FL) is a popular paradigm for collaborative edge learning.
As some of the devices may have limited computational resources and varying availability, FL latency is highly sensitive to stragglers.
We propose straggler-aware layer-wise federated learning (SALF) that leverages the optimization procedure of NNs via backpropagation to update the global model in a layer-wise fashion.
arXiv Detail & Related papers (2024-03-27T09:14:36Z) - Efficient Diffusion Training via Min-SNR Weighting Strategy [78.5801305960993]
We treat the diffusion training as a multi-task learning problem and introduce a simple yet effective approach referred to as Min-SNR-$gamma$.
Our results demonstrate a significant improvement in converging speed, 3.4$times$ faster than previous weighting strategies.
It is also more effective, achieving a new record FID score of 2.06 on the ImageNet $256times256$ benchmark using smaller architectures than that employed in previous state-of-the-art.
arXiv Detail & Related papers (2023-03-16T17:59:56Z) - Design and Prototyping Distributed CNN Inference Acceleration in Edge
Computing [85.74517957717363]
HALP accelerates inference by designing a seamless collaboration among edge devices (EDs) in Edge Computing.
Experiments show that the distributed inference HALP achieves 1.7x inference acceleration for VGG-16.
It is shown that the model selection with distributed inference HALP can significantly improve service reliability.
arXiv Detail & Related papers (2022-11-24T19:48:30Z) - Distributed Adversarial Training to Robustify Deep Neural Networks at
Scale [100.19539096465101]
Current deep neural networks (DNNs) are vulnerable to adversarial attacks, where adversarial perturbations to the inputs can change or manipulate classification.
To defend against such attacks, an effective approach, known as adversarial training (AT), has been shown to mitigate robust training.
We propose a large-batch adversarial training framework implemented over multiple machines.
arXiv Detail & Related papers (2022-06-13T15:39:43Z) - DBS: Dynamic Batch Size For Distributed Deep Neural Network Training [19.766163856388694]
We propose the Dynamic Batch Size (DBS) strategy for the distributedtraining of Deep Neural Networks (DNNs)
Specifically, the performance of each worker is evaluatedfirst based on the fact in the previous epoch, and then the batch size and dataset partition are dynamically adjusted.
The experimental results indicate that the proposed strategy can fully utilizethe performance of the cluster, reduce the training time, and have good robustness with disturbance by irrelevant tasks.
arXiv Detail & Related papers (2020-07-23T07:31:55Z) - Adaptive Periodic Averaging: A Practical Approach to Reducing
Communication in Distributed Learning [6.370766463380455]
We show that the optimal averaging period in terms of convergence and communication cost is not a constant, but instead varies over the course of the execution.
We propose a practical algorithm, named ADaptive Periodic parameter averaging SGD (ADPSGD), to achieve a smaller overall variance of model parameters.
arXiv Detail & Related papers (2020-07-13T00:04:55Z) - DaSGD: Squeezing SGD Parallelization Performance in Distributed Training
Using Delayed Averaging [4.652668321425679]
Minibatch gradient descent (SGD) algorithm requires workers to halt forward/back propagations.
DaSGD parallelizes SGD and forward/back propagations to hide 100% of the communication overhead.
arXiv Detail & Related papers (2020-05-31T05:43:50Z) - Distributed Training of Deep Neural Network Acoustic Models for
Automatic Speech Recognition [33.032361181388886]
We provide an overview of distributed training techniques for deep neural network acoustic models for ASR.
Experiments are carried out on a popular public benchmark to study the convergence, speedup and recognition performance of the investigated strategies.
arXiv Detail & Related papers (2020-02-24T19:31:50Z) - Improving Efficiency in Large-Scale Decentralized Distributed Training [58.80224380923698]
We propose techniques to accelerate (A)D-PSGD based training by improving the spectral gap while minimizing the communication cost.
We demonstrate the effectiveness of our proposed techniques by running experiments on the 2000-hour Switchboard speech recognition task and the ImageNet computer vision task.
arXiv Detail & Related papers (2020-02-04T04:29:09Z) - Elastic Consistency: A General Consistency Model for Distributed
Stochastic Gradient Descent [28.006781039853575]
A key element behind the progress of machine learning in recent years has been the ability to train machine learning models in largescale distributed-memory environments.
In this paper, we introduce general convergence methods used in practice to train large-scale machine learning models.
Our framework, called elastic elastic bounds, enables us to derive convergence bounds for a variety of distributed SGD methods.
arXiv Detail & Related papers (2020-01-16T16:10:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.