Sub-8-bit quantization for on-device speech recognition: a
regularization-free approach
- URL: http://arxiv.org/abs/2210.09188v1
- Date: Mon, 17 Oct 2022 15:42:26 GMT
- Title: Sub-8-bit quantization for on-device speech recognition: a
regularization-free approach
- Authors: Kai Zhen, Martin Radfar, Hieu Duy Nguyen, Grant P. Strimel, Nathan
Susanj, Athanasios Mouchtaris
- Abstract summary: General Quantizer (GQ) is a regularization-free, "soft-to-hard" compression mechanism with self-adjustable centroids.
GQ can compress both RNN-T and Conformer into sub-8-bit, and for some RNN-T layers, to 1-bit for fast and accurate inference.
- Score: 19.84792318335999
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For on-device automatic speech recognition (ASR), quantization aware training
(QAT) is ubiquitous to achieve the trade-off between model predictive
performance and efficiency. Among existing QAT methods, one major drawback is
that the quantization centroids have to be predetermined and fixed. To overcome
this limitation, we introduce a regularization-free, "soft-to-hard" compression
mechanism with self-adjustable centroids in a mu-Law constrained space,
resulting in a simpler yet more versatile quantization scheme, called General
Quantizer (GQ). We apply GQ to ASR tasks using Recurrent Neural Network
Transducer (RNN-T) and Conformer architectures on both LibriSpeech and
de-identified far-field datasets. Without accuracy degradation, GQ can compress
both RNN-T and Conformer into sub-8-bit, and for some RNN-T layers, to 1-bit
for fast and accurate inference. We observe a 30.73% memory footprint saving
and 31.75% user-perceived latency reduction compared to 8-bit QAT via physical
device benchmarking.
Related papers
- Gradient-based Automatic Mixed Precision Quantization for Neural Networks On-Chip [0.9187138676564589]
We present High Granularity Quantization (HGQ), an innovative quantization-aware training method.
HGQ fine-tune the per-weight and per-activation precision by making them optimizable through gradient descent.
This approach enables ultra-low latency and low power neural networks on hardware capable of performing arithmetic operations.
arXiv Detail & Related papers (2024-05-01T17:18:46Z) - Weight Re-Mapping for Variational Quantum Algorithms [54.854986762287126]
We introduce the concept of weight re-mapping for variational quantum circuits (VQCs)
We employ seven distinct weight re-mapping functions to assess their impact on eight classification datasets.
Our results indicate that weight re-mapping can enhance the convergence speed of the VQC.
arXiv Detail & Related papers (2023-06-09T09:42:21Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - AutoQNN: An End-to-End Framework for Automatically Quantizing Neural
Networks [6.495218751128902]
We propose an end-to-end framework named AutoQNN, for automatically quantizing different layers utilizing different schemes and bitwidths without any human labor.
QPL is the first method to learn mixed-precision policies by re parameterizing the bitwidths of quantizing schemes.
QAG is designed to convert arbitrary architectures into corresponding quantized ones without manual intervention.
arXiv Detail & Related papers (2023-04-07T11:14:21Z) - Convolutional Neural Networks Quantization with Attention [1.0312968200748118]
We propose a method, double-stage Squeeze-and-Threshold (double-stage ST)
It uses the attention mechanism to quantize networks and achieve state-of-art results.
arXiv Detail & Related papers (2022-09-30T08:48:31Z) - Sub-8-Bit Quantization Aware Training for 8-Bit Neural Network
Accelerator with On-Device Speech Recognition [19.949933989959682]
We present a novel sub-8-bit quantization-aware training (S8BQAT) scheme for 8-bit neural network accelerators.
We are able to increase the model parameter size to reduce the word error rate by 4-16% relatively, while still improving latency by 5%.
arXiv Detail & Related papers (2022-06-30T16:52:07Z) - Mixed Precision Low-bit Quantization of Neural Network Language Models
for Speech Recognition [67.95996816744251]
State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications.
Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors.
Novel mixed precision neural network LM quantization methods are proposed in this paper.
arXiv Detail & Related papers (2021-11-29T12:24:02Z) - Sequence Transduction with Graph-based Supervision [96.04967815520193]
We present a new transducer objective function that generalizes the RNN-T loss to accept a graph representation of the labels.
We demonstrate that transducer-based ASR with CTC-like lattice achieves better results compared to standard RNN-T.
arXiv Detail & Related papers (2021-11-01T21:51:42Z) - Quantization-Guided Training for Compact TinyML Models [8.266286436571887]
We propose a Quantization Guided Training (QGT) method to guide DNN training towards optimized low-bit-precision targets.
QGT uses customized regularization to encourage weight values towards a distribution that maximizes accuracy while reducing quantization errors.
arXiv Detail & Related papers (2021-03-10T18:06:05Z) - FastEmit: Low-latency Streaming ASR with Sequence-level Emission
Regularization [78.46088089185156]
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible.
Existing approaches penalize emission delay by manipulating per-token or per-frame probability prediction in sequence transducer models.
We propose a sequence-level emission regularization method, named FastEmit, that applies latency regularization directly on per-sequence probability in training transducer models.
arXiv Detail & Related papers (2020-10-21T17:05:01Z) - APQ: Joint Search for Network Architecture, Pruning and Quantization
Policy [49.3037538647714]
We present APQ for efficient deep learning inference on resource-constrained hardware.
Unlike previous methods that separately search the neural architecture, pruning policy, and quantization policy, we optimize them in a joint manner.
With the same accuracy, APQ reduces the latency/energy by 2x/1.3x over MobileNetV2+HAQ.
arXiv Detail & Related papers (2020-06-15T16:09:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.