4-bit Quantization of LSTM-based Speech Recognition Models
- URL: http://arxiv.org/abs/2108.12074v1
- Date: Fri, 27 Aug 2021 00:59:52 GMT
- Title: 4-bit Quantization of LSTM-based Speech Recognition Models
- Authors: Andrea Fasoli, Chia-Yu Chen, Mauricio Serrano, Xiao Sun, Naigang Wang,
Swagath Venkataramani, George Saon, Xiaodong Cui, Brian Kingsbury, Wei Zhang,
Zolt\'an T\"uske, Kailash Gopalakrishnan
- Abstract summary: We investigate the impact of aggressive low-precision representations of weights and activations in two families of large LSTM-based architectures for Automatic Speech Recognition.
We show that minimal accuracy loss is achievable with an appropriate choice of quantizers and initializations.
- Score: 40.614677908909705
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate the impact of aggressive low-precision representations of
weights and activations in two families of large LSTM-based architectures for
Automatic Speech Recognition (ASR): hybrid Deep Bidirectional LSTM - Hidden
Markov Models (DBLSTM-HMMs) and Recurrent Neural Network - Transducers
(RNN-Ts). Using a 4-bit integer representation, a na\"ive quantization approach
applied to the LSTM portion of these models results in significant Word Error
Rate (WER) degradation. On the other hand, we show that minimal accuracy loss
is achievable with an appropriate choice of quantizers and initializations. In
particular, we customize quantization schemes depending on the local properties
of the network, improving recognition performance while limiting computational
time. We demonstrate our solution on the Switchboard (SWB) and CallHome (CH)
test sets of the NIST Hub5-2000 evaluation. DBLSTM-HMMs trained with 300 or
2000 hours of SWB data achieves $<$0.5% and $<$1% average WER degradation,
respectively. On the more challenging RNN-T models, our quantization strategy
limits degradation in 4-bit inference to 1.3%.
Related papers
- LSTM-QGAN: Scalable NISQ Generative Adversarial Network [3.596166341956192]
Current quantum generative adversarial networks (QGANs) still struggle with practical-sized data.
We propose LSTM-QGAN, a QGAN architecture that eliminates preprocessing and integrates quantum long short-term memory (QLSTM) to ensure scalable performance.
Our experiments show that LSTM-QGAN significantly enhances both performance and scalability over state-of-the-art QGAN models.
arXiv Detail & Related papers (2024-09-03T18:27:15Z) - Bayesian Neural Network Language Modeling for Speech Recognition [59.681758762712754]
State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex.
In this paper, an overarching full Bayesian learning framework is proposed to account for the underlying uncertainty in LSTM-RNN and Transformer LMs.
arXiv Detail & Related papers (2022-08-28T17:50:19Z) - Low-bit Shift Network for End-to-End Spoken Language Understanding [7.851607739211987]
We propose the use of power-of-two quantization, which quantizes continuous parameters into low-bit power-of-two values.
This reduces computational complexity by removing expensive multiplication operations and with the use of low-bit weights.
arXiv Detail & Related papers (2022-07-15T14:34:22Z) - Accelerating Inference and Language Model Fusion of Recurrent Neural
Network Transducers via End-to-End 4-bit Quantization [35.198615417316056]
We report on aggressive quantization strategies that greatly accelerate inference of Recurrent Neural Network Transducers (RNN-T)
We use a 4 bit integer representation for both weights and activations and apply Quantization Aware Training (QAT) to retrain the full model.
We show that customized quantization schemes that are tailored to the local properties of the network are essential to achieve good performance.
arXiv Detail & Related papers (2022-06-16T02:17:49Z) - Improving Generalization of Deep Neural Network Acoustic Models with
Length Perturbation and N-best Based Label Smoothing [49.82147684491619]
We introduce two techniques to improve generalization of deep neural network (DNN) acoustic models for automatic speech recognition (ASR)
Length perturbation is a data augmentation algorithm that randomly drops and inserts frames of an utterance to alter the length of the speech feature sequence.
N-best based label smoothing randomly injects noise to ground truth labels during training in order to avoid overfitting, where the noisy labels are generated from n-best hypotheses.
arXiv Detail & Related papers (2022-03-29T01:40:22Z) - Mixed Precision Low-bit Quantization of Neural Network Language Models
for Speech Recognition [67.95996816744251]
State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications.
Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors.
Novel mixed precision neural network LM quantization methods are proposed in this paper.
arXiv Detail & Related papers (2021-11-29T12:24:02Z) - Q-ASR: Integer-only Zero-shot Quantization for Efficient Speech
Recognition [65.7040645560855]
We propose Q-ASR, an integer-only, zero-shot quantization scheme for ASR models.
We show negligible WER change as compared to the full-precision baseline models.
Q-ASR exhibits a large compression rate of more than 4x with small WER degradation.
arXiv Detail & Related papers (2021-03-31T06:05:40Z) - Neural Architecture Search For LF-MMI Trained Time Delay Neural Networks [61.76338096980383]
A range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper- parameters of state-of-the-art factored time delay neural networks (TDNNs)
These include the DARTS method integrating architecture selection with lattice-free MMI (LF-MMI) TDNN training.
Experiments conducted on a 300-hour Switchboard corpus suggest the auto-configured systems consistently outperform the baseline LF-MMI TDNN systems.
arXiv Detail & Related papers (2020-07-17T08:32:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.