Mixed Precision Low-bit Quantization of Neural Network Language Models
for Speech Recognition
- URL: http://arxiv.org/abs/2112.11438v1
- Date: Mon, 29 Nov 2021 12:24:02 GMT
- Title: Mixed Precision Low-bit Quantization of Neural Network Language Models
for Speech Recognition
- Authors: Junhao Xu, Jianwei Yu, Shoukang Hu, Xunying Liu, Helen Meng
- Abstract summary: State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications.
Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors.
Novel mixed precision neural network LM quantization methods are proposed in this paper.
- Score: 67.95996816744251
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: State-of-the-art language models (LMs) represented by long-short term memory
recurrent neural networks (LSTM-RNNs) and Transformers are becoming
increasingly complex and expensive for practical applications. Low-bit neural
network quantization provides a powerful solution to dramatically reduce their
model size. Current quantization methods are based on uniform precision and
fail to account for the varying performance sensitivity at different parts of
LMs to quantization errors. To this end, novel mixed precision neural network
LM quantization methods are proposed in this paper. The optimal local precision
choices for LSTM-RNN and Transformer based neural LMs are automatically learned
using three techniques. The first two approaches are based on quantization
sensitivity metrics in the form of either the KL-divergence measured between
full precision and quantized LMs, or Hessian trace weighted quantization
perturbation that can be approximated efficiently using matrix free techniques.
The third approach is based on mixed precision neural architecture search. In
order to overcome the difficulty in using gradient descent methods to directly
estimate discrete quantized weights, alternating direction methods of
multipliers (ADMM) are used to efficiently train quantized LMs. Experiments
were conducted on state-of-the-art LF-MMI CNN-TDNN systems featuring speed
perturbation, i-Vector and learning hidden unit contribution (LHUC) based
speaker adaptation on two tasks: Switchboard telephone speech and AMI meeting
transcription. The proposed mixed precision quantization techniques achieved
"lossless" quantization on both tasks, by producing model size compression
ratios of up to approximately 16 times over the full precision LSTM and
Transformer baseline LMs, while incurring no statistically significant word
error rate increase.
Related papers
- SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [67.67135738642547]
Post-training quantization (PTQ) is a powerful compression technique investigated in large language models (LLMs)
Existing PTQ methods are not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths.
This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - DB-LLM: Accurate Dual-Binarization for Efficient LLMs [83.70686728471547]
Large language models (LLMs) have significantly advanced the field of natural language processing.
Existing ultra-low-bit quantization always causes severe accuracy drops.
We propose a novel Dual-Binarization method for LLMs, namely DB-LLM.
arXiv Detail & Related papers (2024-02-19T09:04:30Z) - Mixed Precision Post Training Quantization of Neural Networks with
Sensitivity Guided Search [7.392278887917975]
Mixed-precision quantization allows different tensors to be quantized to varying levels of numerical precision.
We evaluate our method for computer vision and natural language processing and demonstrate latency reductions of up to 27.59% and 34.31%.
arXiv Detail & Related papers (2023-02-02T19:30:00Z) - Bayesian Neural Network Language Modeling for Speech Recognition [59.681758762712754]
State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex.
In this paper, an overarching full Bayesian learning framework is proposed to account for the underlying uncertainty in LSTM-RNN and Transformer LMs.
arXiv Detail & Related papers (2022-08-28T17:50:19Z) - LG-LSQ: Learned Gradient Linear Symmetric Quantization [3.6816597150770387]
Deep neural networks with lower precision weights have advantages in terms of the cost of memory space and accelerator power.
The main challenge associated with the quantization algorithm is maintaining accuracy at low bit-widths.
We propose learned gradient linear symmetric quantization (LG-LSQ) as a method for quantizing weights and activation functions to low bit-widths.
arXiv Detail & Related papers (2022-02-18T03:38:12Z) - Mixed Precision of Quantization of Transformer Language Models for
Speech Recognition [67.95996816744251]
State-of-the-art neural language models represented by Transformers are becoming increasingly complex and expensive for practical applications.
Current low-bit quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of the system to quantization errors.
The optimal local precision settings are automatically learned using two techniques.
Experiments conducted on Penn Treebank (PTB) and a Switchboard corpus trained LF-MMI TDNN system.
arXiv Detail & Related papers (2021-11-29T09:57:00Z) - Low-bit Quantization of Recurrent Neural Network Language Models Using
Alternating Direction Methods of Multipliers [67.688697838109]
This paper presents a novel method to train quantized RNNLMs from scratch using alternating direction methods of multipliers (ADMM)
Experiments on two tasks suggest the proposed ADMM quantization achieved a model size compression factor of up to 31 times over the full precision baseline RNNLMs.
arXiv Detail & Related papers (2021-11-29T09:30:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.