RAND: Robustness Aware Norm Decay For Quantized Seq2seq Models
- URL: http://arxiv.org/abs/2305.15536v1
- Date: Wed, 24 May 2023 19:45:56 GMT
- Title: RAND: Robustness Aware Norm Decay For Quantized Seq2seq Models
- Authors: David Qiu, David Rim, Shaojin Ding, Oleg Rybakov, Yanzhang He
- Abstract summary: We propose low complexity changes to the quantization aware training (QAT) process to improve model accuracy.
With the improved accuracy, it opens up the possibility to exploit some of the other benefits of noise based QAT.
- Score: 14.07649230604283
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rapid increase in the size of neural networks, model compression has
become an important area of research. Quantization is an effective technique at
decreasing the model size, memory access, and compute load of large models.
Despite recent advances in quantization aware training (QAT) technique, most
papers present evaluations that are focused on computer vision tasks, which
have different training dynamics compared to sequence tasks. In this paper, we
first benchmark the impact of popular techniques such as straight through
estimator, pseudo-quantization noise, learnable scale parameter, clipping, etc.
on 4-bit seq2seq models across a suite of speech recognition datasets ranging
from 1,000 hours to 1 million hours, as well as one machine translation dataset
to illustrate its applicability outside of speech.
Through the experiments, we report that noise based QAT suffers when there is
insufficient regularization signal flowing back to the quantization scale. We
propose low complexity changes to the QAT process to improve model accuracy
(outperforming popular learnable scale and clipping methods). With the improved
accuracy, it opens up the possibility to exploit some of the other benefits of
noise based QAT: 1) training a single model that performs well in mixed
precision mode and 2) improved generalization on long form speech recognition.
Related papers
- GWQ: Gradient-Aware Weight Quantization for Large Language Models [61.17678373122165]
gradient-aware weight quantization (GWQ) is the first quantization approach for low-bit weight quantization that leverages gradients to localize outliers.
GWQ retains the corresponding to the top 1% outliers preferentially at FP16 precision, while the remaining non-outlier weights are stored in a low-bit format.
In the zero-shot task, GWQ quantized models have higher accuracy compared to other quantization methods.
arXiv Detail & Related papers (2024-10-30T11:16:04Z) - Resource-Efficient Speech Quality Prediction through Quantization Aware Training and Binary Activation Maps [4.002057316863807]
We investigate binary activation maps (BAMs) for speech quality prediction on a convolutional architecture based on DNSMOS.
We show that the binary activation model with quantization aware training matches the predictive performance of the baseline model.
Our approach results in a 25-fold memory reduction during inference, while replacing almost all dot products with summations.
arXiv Detail & Related papers (2024-07-05T15:15:00Z) - SQUAT: Stateful Quantization-Aware Training in Recurrent Spiking Neural Networks [1.0923877073891446]
Spiking neural networks (SNNs) share the goal of enhancing efficiency, but adopt an 'event-driven' approach to reduce the power consumption of neural network inference.
This paper introduces two QAT schemes for stateful neurons: (i) a uniform quantization strategy, an established method for weight quantization, and (ii) threshold-centered quantization.
Our results show that increasing the density of quantization levels around the firing threshold improves accuracy across several benchmark datasets.
arXiv Detail & Related papers (2024-04-15T03:07:16Z) - Self-Supervised Speech Quality Estimation and Enhancement Using Only
Clean Speech [50.95292368372455]
We propose VQScore, a self-supervised metric for evaluating speech based on the quantization error of a vector-quantized-variational autoencoder (VQ-VAE)
The training of VQ-VAE relies on clean speech; hence, large quantization errors can be expected when the speech is distorted.
We found that the vector quantization mechanism could also be used for self-supervised speech enhancement (SE) model training.
arXiv Detail & Related papers (2024-02-26T06:01:38Z) - Post-Training Quantization for Re-parameterization via Coarse & Fine
Weight Splitting [13.270381125055275]
We propose a coarse & fine weight splitting (CFWS) method to reduce quantization error of weight.
We develop an improved KL metric to determine optimal quantization scales for activation.
For example, the quantized RepVGG-A1 model exhibits a mere 0.3% accuracy loss.
arXiv Detail & Related papers (2023-12-17T02:31:20Z) - When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable.
In order to achieve a better accuracy, we propose two lightweight modules.
DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers.
QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z) - Q-ASR: Integer-only Zero-shot Quantization for Efficient Speech
Recognition [65.7040645560855]
We propose Q-ASR, an integer-only, zero-shot quantization scheme for ASR models.
We show negligible WER change as compared to the full-precision baseline models.
Q-ASR exhibits a large compression rate of more than 4x with small WER degradation.
arXiv Detail & Related papers (2021-03-31T06:05:40Z) - BRECQ: Pushing the Limit of Post-Training Quantization by Block
Reconstruction [29.040991149922615]
We study the challenging task of neural network quantization without end-to-end retraining, called Post-training Quantization (PTQ)
We propose a novel PTQ framework, dubbed BRECQ, which pushes the limits of bitwidth in PTQ down to INT2 for the first time.
For the first time we prove that, without bells and whistles, PTQ can attain 4-bit ResNet and MobileNetV2 comparable with QAT and enjoy 240 times faster production of quantized models.
arXiv Detail & Related papers (2021-02-10T13:46:16Z) - KDLSQ-BERT: A Quantized Bert Combining Knowledge Distillation with
Learned Step Size Quantization [1.9786767260073905]
transformer-based language models such as BERT have shown tremendous performance improvement for a range of natural language processing tasks.
We propose a novel quantization method named KDLSQ-BERT that combines knowledge distillation (KD) with learned step size quantization (LSQ) for language model quantization.
arXiv Detail & Related papers (2021-01-15T02:21:28Z) - Once Quantization-Aware Training: High Performance Extremely Low-bit
Architecture Search [112.05977301976613]
We propose to combine Network Architecture Search methods with quantization to enjoy the merits of the two sides.
We first propose the joint training of architecture and quantization with a shared step size to acquire a large number of quantized models.
Then a bit-inheritance scheme is introduced to transfer the quantized models to the lower bit, which further reduces the time cost and improves the quantization accuracy.
arXiv Detail & Related papers (2020-10-09T03:52:16Z) - Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks [133.93803565077337]
retrieval-augmented generation models combine pre-trained parametric and non-parametric memory for language generation.
We show that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.
arXiv Detail & Related papers (2020-05-22T21:34:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.