Related papers: Sub-8-bit quantization for on-device speech recognition: a regularization-free approach

Sub-8-bit quantization for on-device speech recognition: a regularization-free approach

URL: http://arxiv.org/abs/2210.09188v1
Date: Mon, 17 Oct 2022 15:42:26 GMT
Title: Sub-8-bit quantization for on-device speech recognition: a regularization-free approach
Authors: Kai Zhen, Martin Radfar, Hieu Duy Nguyen, Grant P. Strimel, Nathan Susanj, Athanasios Mouchtaris
Abstract summary: General Quantizer (GQ) is a regularization-free, "soft-to-hard" compression mechanism with self-adjustable centroids. GQ can compress both RNN-T and Conformer into sub-8-bit, and for some RNN-T layers, to 1-bit for fast and accurate inference.
Score: 19.84792318335999
License: http://creativecommons.org/licenses/by/4.0/
Abstract: For on-device automatic speech recognition (ASR), quantization aware training (QAT) is ubiquitous to achieve the trade-off between model predictive performance and efficiency. Among existing QAT methods, one major drawback is that the quantization centroids have to be predetermined and fixed. To overcome this limitation, we introduce a regularization-free, "soft-to-hard" compression mechanism with self-adjustable centroids in a mu-Law constrained space, resulting in a simpler yet more versatile quantization scheme, called General Quantizer (GQ). We apply GQ to ASR tasks using Recurrent Neural Network Transducer (RNN-T) and Conformer architectures on both LibriSpeech and de-identified far-field datasets. Without accuracy degradation, GQ can compress both RNN-T and Conformer into sub-8-bit, and for some RNN-T layers, to 1-bit for fast and accurate inference. We observe a 30.73% memory footprint saving and 31.75% user-perceived latency reduction compared to 8-bit QAT via physical device benchmarking.

Related papers

Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression [55.323397702682506]
Post-training quantization (PTQ) reduces a model's memory footprint by mapping full precision weights into low bit weights without costly retraining. We develop a new mixed-precision PTQ approach, Task-Circuit Quantization (TaCQ), that draws parallels to automated circuit discovery.
arXiv Detail & Related papers (2025-04-10T02:19:03Z)
Histogram-Equalized Quantization for logic-gated Residual Neural Networks [2.7036595757881323]
Histogram-Equalized Quantization (HEQ) is an adaptive framework for linear symmetric quantization. HEQ automatically adapts the quantization thresholds using a unique step size optimization. Experiments on the STL-10 dataset even show that HEQ enables a proper training of our proposed logic-gated (OR, MUX) residual networks.
arXiv Detail & Related papers (2025-01-08T14:06:07Z)
Gradient-based Automatic Mixed Precision Quantization for Neural Networks On-Chip [0.9187138676564589]
We present High Granularity Quantization (HGQ), an innovative quantization-aware training method. HGQ fine-tune the per-weight and per-activation precision by making them optimizable through gradient descent. This approach enables ultra-low latency and low power neural networks on hardware capable of performing arithmetic operations.
arXiv Detail & Related papers (2024-05-01T17:18:46Z)
Weight Re-Mapping for Variational Quantum Algorithms [54.854986762287126]
We introduce the concept of weight re-mapping for variational quantum circuits (VQCs) We employ seven distinct weight re-mapping functions to assess their impact on eight classification datasets. Our results indicate that weight re-mapping can enhance the convergence speed of the VQC.
arXiv Detail & Related papers (2023-06-09T09:42:21Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
AutoQNN: An End-to-End Framework for Automatically Quantizing Neural Networks [6.495218751128902]
We propose an end-to-end framework named AutoQNN, for automatically quantizing different layers utilizing different schemes and bitwidths without any human labor. QPL is the first method to learn mixed-precision policies by re parameterizing the bitwidths of quantizing schemes. QAG is designed to convert arbitrary architectures into corresponding quantized ones without manual intervention.
arXiv Detail & Related papers (2023-04-07T11:14:21Z)
Convolutional Neural Networks Quantization with Attention [1.0312968200748118]
We propose a method, double-stage Squeeze-and-Threshold (double-stage ST) It uses the attention mechanism to quantize networks and achieve state-of-art results.
arXiv Detail & Related papers (2022-09-30T08:48:31Z)
Sub-8-Bit Quantization Aware Training for 8-Bit Neural Network Accelerator with On-Device Speech Recognition [19.949933989959682]
We present a novel sub-8-bit quantization-aware training (S8BQAT) scheme for 8-bit neural network accelerators. We are able to increase the model parameter size to reduce the word error rate by 4-16% relatively, while still improving latency by 5%.
arXiv Detail & Related papers (2022-06-30T16:52:07Z)
Mixed Precision Low-bit Quantization of Neural Network Language Models for Speech Recognition [67.95996816744251]
State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications. Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors. Novel mixed precision neural network LM quantization methods are proposed in this paper.
arXiv Detail & Related papers (2021-11-29T12:24:02Z)
Sequence Transduction with Graph-based Supervision [96.04967815520193]
We present a new transducer objective function that generalizes the RNN-T loss to accept a graph representation of the labels. We demonstrate that transducer-based ASR with CTC-like lattice achieves better results compared to standard RNN-T.
arXiv Detail & Related papers (2021-11-01T21:51:42Z)
Quantization-Guided Training for Compact TinyML Models [8.266286436571887]
We propose a Quantization Guided Training (QGT) method to guide DNN training towards optimized low-bit-precision targets. QGT uses customized regularization to encourage weight values towards a distribution that maximizes accuracy while reducing quantization errors.
arXiv Detail & Related papers (2021-03-10T18:06:05Z)
FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization [78.46088089185156]
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible. Existing approaches penalize emission delay by manipulating per-token or per-frame probability prediction in sequence transducer models. We propose a sequence-level emission regularization method, named FastEmit, that applies latency regularization directly on per-sequence probability in training transducer models.
arXiv Detail & Related papers (2020-10-21T17:05:01Z)
APQ: Joint Search for Network Architecture, Pruning and Quantization Policy [49.3037538647714]
We present APQ for efficient deep learning inference on resource-constrained hardware. Unlike previous methods that separately search the neural architecture, pruning policy, and quantization policy, we optimize them in a joint manner. With the same accuracy, APQ reduces the latency/energy by 2x/1.3x over MobileNetV2+HAQ.
arXiv Detail & Related papers (2020-06-15T16:09:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.