Q-ASR: Integer-only Zero-shot Quantization for Efficient Speech
Recognition
- URL: http://arxiv.org/abs/2103.16827v1
- Date: Wed, 31 Mar 2021 06:05:40 GMT
- Title: Q-ASR: Integer-only Zero-shot Quantization for Efficient Speech
Recognition
- Authors: Sehoon Kim, Amir Gholami, Zhewei Yao, Anirudda Nrusimha, Bohan Zhai,
Tianren Gao, Michael W. Mahoney, Kurt Keutzer
- Abstract summary: We propose Q-ASR, an integer-only, zero-shot quantization scheme for ASR models.
We show negligible WER change as compared to the full-precision baseline models.
Q-ASR exhibits a large compression rate of more than 4x with small WER degradation.
- Score: 65.7040645560855
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end neural network models achieve improved performance on various
automatic speech recognition (ASR) tasks. However, these models perform poorly
on edge hardware due to large memory and computation requirements. While
quantizing model weights and/or activations to low-precision can be a promising
solution, previous research on quantizing ASR models is limited. Most
quantization approaches use floating-point arithmetic during inference; and
thus they cannot fully exploit integer processing units, which use less power
than their floating-point counterparts. Moreover, they require
training/validation data during quantization for finetuning or calibration;
however, this data may not be available due to security/privacy concerns. To
address these limitations, we propose Q-ASR, an integer-only, zero-shot
quantization scheme for ASR models. In particular, we generate synthetic data
whose runtime statistics resemble the real data, and we use it to calibrate
models during quantization. We then apply Q-ASR to quantize QuartzNet-15x5 and
JasperDR-10x5 without any training data, and we show negligible WER change as
compared to the full-precision baseline models. For INT8-only quantization, we
observe a very modest WER degradation of up to 0.29%, while we achieve up to
2.44x speedup on a T4 GPU. Furthermore, Q-ASR exhibits a large compression rate
of more than 4x with small WER degradation.
Related papers
- SVDQunat: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models [58.5019443418822]
Diffusion models have been proven highly effective at generating high-quality images.
As these models grow larger, they require significantly more memory and suffer from higher latency.
In this work, we aim to accelerate diffusion models by quantizing their weights and activations to 4 bits.
arXiv Detail & Related papers (2024-11-07T18:59:58Z) - MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization [16.83403134551842]
Recent few-step diffusion models reduce the inference time by reducing the denoising steps.
The Post Training Quantization (PTQ) replaces high bit-width FP representation with low-bit integer values.
However, when applying to few-step diffusion models, existing quantization methods face challenges in preserving both the image quality and text alignment.
arXiv Detail & Related papers (2024-05-28T06:50:58Z) - EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models [21.17675493267517]
Post-training quantization (PTQ) and quantization-aware training (QAT) are two main approaches to compress and accelerate diffusion models.
We introduce a data-free and parameter-efficient fine-tuning framework for low-bit diffusion models, dubbed EfficientDM, to achieve QAT-level performance with PTQ-like efficiency.
Our method significantly outperforms previous PTQ-based diffusion models while maintaining similar time and data efficiency.
arXiv Detail & Related papers (2023-10-05T02:51:53Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - RAND: Robustness Aware Norm Decay For Quantized Seq2seq Models [14.07649230604283]
We propose low complexity changes to the quantization aware training (QAT) process to improve model accuracy.
With the improved accuracy, it opens up the possibility to exploit some of the other benefits of noise based QAT.
arXiv Detail & Related papers (2023-05-24T19:45:56Z) - QLoRA: Efficient Finetuning of Quantized LLMs [66.58009990713134]
We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU.
QLoRA backpropagates through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters(LoRA)
Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark.
arXiv Detail & Related papers (2023-05-23T17:50:33Z) - A High-Performance Adaptive Quantization Approach for Edge CNN
Applications [0.225596179391365]
Recent convolutional neural network (CNN) development continues to advance the state-of-the-art model accuracy for various applications.
The enhanced accuracy comes at the cost of substantial memory bandwidth and storage requirements.
In this paper, we introduce an adaptive high-performance quantization method to resolve the issue of biased activation.
arXiv Detail & Related papers (2021-07-18T07:49:18Z) - Pareto-Optimal Quantized ResNet Is Mostly 4-bit [3.83996783171716]
We use ResNet as a case study to investigate the effects of quantization on inference compute cost-quality tradeoff curves.
Our results suggest that for each bfloat16 ResNet model, there are quantized models with lower cost and higher accuracy.
We achieve state-of-the-art results on ImageNet for 4-bit ResNet-50 with quantization-aware training, obtaining a top-1 eval accuracy of 77.09%.
arXiv Detail & Related papers (2021-05-07T23:28:37Z) - HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.
We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z) - Widening and Squeezing: Towards Accurate and Efficient QNNs [125.172220129257]
Quantization neural networks (QNNs) are very attractive to the industry because their extremely cheap calculation and storage overhead, but their performance is still worse than that of networks with full-precision parameters.
Most of existing methods aim to enhance performance of QNNs especially binary neural networks by exploiting more effective training techniques.
We address this problem by projecting features in original full-precision networks to high-dimensional quantization features.
arXiv Detail & Related papers (2020-02-03T04:11:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.