Oh! We Freeze: Improving Quantized Knowledge Distillation via Signal Propagation Analysis for Large Language Models
- URL: http://arxiv.org/abs/2403.18159v2
- Date: Thu, 28 Mar 2024 08:22:31 GMT
- Title: Oh! We Freeze: Improving Quantized Knowledge Distillation via Signal Propagation Analysis for Large Language Models
- Authors: Kartikeya Bhardwaj, Nilesh Prasad Pandey, Sweta Priyadarshi, Kyunggeun Lee, Jun Ma, Harris Teague,
- Abstract summary: Large generative models such as large language models (LLMs) and diffusion models have revolutionized the fields of NLP and computer vision respectively.
In this study, we propose a light-weight quantization aware fine tuning technique using knowledge distillation (KD-QAT) to improve the performance of 4-bit weight quantized LLMs.
We show that ov-freeze results in near floating point precision performance, i.e., less than 0.7% loss of accuracy on Commonsense Reasoning benchmarks.
- Score: 5.69541128149828
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large generative models such as large language models (LLMs) and diffusion models have revolutionized the fields of NLP and computer vision respectively. However, their slow inference, high computation and memory requirement makes it challenging to deploy them on edge devices. In this study, we propose a light-weight quantization aware fine tuning technique using knowledge distillation (KD-QAT) to improve the performance of 4-bit weight quantized LLMs using commonly available datasets to realize a popular language use case, on device chat applications. To improve this paradigm of finetuning, as main contributions, we provide insights into stability of KD-QAT by empirically studying the gradient propagation during training to better understand the vulnerabilities of KD-QAT based approaches to low-bit quantization errors. Based on our insights, we propose ov-freeze, a simple technique to stabilize the KD-QAT process. Finally, we experiment with the popular 7B LLaMAv2-Chat model at 4-bit quantization level and demonstrate that ov-freeze results in near floating point precision performance, i.e., less than 0.7% loss of accuracy on Commonsense Reasoning benchmarks.
Related papers
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving [52.31791050376249]
Quantization can accelerate large language model (LLM) inference.
Existing INT4 quantization methods suffer from significant runtime overhead when dequantizing weights or partial sums.
We introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache.
QServe improves the maximum achievable serving of Llama-3-8B by 1.2x on A100, 1.4x on L40S; and Qwen-721.5B by 2.4x on A100, 3.5x on L40S.
arXiv Detail & Related papers (2024-05-07T17:59:30Z) - DB-LLM: Accurate Dual-Binarization for Efficient LLMs [83.70686728471547]
Large language models (LLMs) have significantly advanced the field of natural language processing.
Existing ultra-low-bit quantization always causes severe accuracy drops.
We propose a novel Dual-Binarization method for LLMs, namely DB-LLM.
arXiv Detail & Related papers (2024-02-19T09:04:30Z) - L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models [5.304907804008533]
Post-training quantization (PTQ) is more commonly used in previous works than quantization-aware training (QAT)
By design, L4Q allows quantization parameters to reflect weight updates, while weight updates reduce quantization errors.
Our experiments demonstrate that this coupled quantization and fine-tuning approach yields superior accuracy compared to decoupled fine-tuning schemes in sub-4-bit quantization.
arXiv Detail & Related papers (2024-02-07T14:35:05Z) - Norm Tweaking: High-performance Low-bit Quantization of Large Language
Models [21.855106896725598]
We introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision.
Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations.
Our simple and effective approach makes it more practical for real-world applications.
arXiv Detail & Related papers (2023-09-06T06:51:15Z) - RAND: Robustness Aware Norm Decay For Quantized Seq2seq Models [14.07649230604283]
We propose low complexity changes to the quantization aware training (QAT) process to improve model accuracy.
With the improved accuracy, it opens up the possibility to exploit some of the other benefits of noise based QAT.
arXiv Detail & Related papers (2023-05-24T19:45:56Z) - Fixed-point quantization aware training for on-device keyword-spotting [4.4488246947396695]
We propose a novel method to train and obtain FXP convolutional keyword-spotting (KWS) models.
We combine our methodology with two quantization-aware-training (QAT) techniques.
We demonstrate that we can reduce execution time by 68% without compromising KWS model's predictive performance.
arXiv Detail & Related papers (2023-03-04T01:06:16Z) - Efficient training of lightweight neural networks using Online
Self-Acquired Knowledge Distillation [51.66271681532262]
Online Self-Acquired Knowledge Distillation (OSAKD) is proposed, aiming to improve the performance of any deep neural model in an online manner.
We utilize k-nn non-parametric density estimation technique for estimating the unknown probability distributions of the data samples in the output feature space.
arXiv Detail & Related papers (2021-08-26T14:01:04Z) - Q-ASR: Integer-only Zero-shot Quantization for Efficient Speech
Recognition [65.7040645560855]
We propose Q-ASR, an integer-only, zero-shot quantization scheme for ASR models.
We show negligible WER change as compared to the full-precision baseline models.
Q-ASR exhibits a large compression rate of more than 4x with small WER degradation.
arXiv Detail & Related papers (2021-03-31T06:05:40Z) - KDLSQ-BERT: A Quantized Bert Combining Knowledge Distillation with
Learned Step Size Quantization [1.9786767260073905]
transformer-based language models such as BERT have shown tremendous performance improvement for a range of natural language processing tasks.
We propose a novel quantization method named KDLSQ-BERT that combines knowledge distillation (KD) with learned step size quantization (LSQ) for language model quantization.
arXiv Detail & Related papers (2021-01-15T02:21:28Z) - DAQ: Distribution-Aware Quantization for Deep Image Super-Resolution
Networks [49.191062785007006]
Quantizing deep convolutional neural networks for image super-resolution substantially reduces their computational costs.
Existing works either suffer from a severe performance drop in ultra-low precision of 4 or lower bit-widths, or require a heavy fine-tuning process to recover the performance.
We propose a novel distribution-aware quantization scheme (DAQ) which facilitates accurate training-free quantization in ultra-low precision.
arXiv Detail & Related papers (2020-12-21T10:19:42Z) - A Statistical Framework for Low-bitwidth Training of Deep Neural
Networks [70.77754244060384]
Fully quantized training (FQT) uses low-bitwidth hardware by quantizing the activations, weights, and gradients of a neural network model.
One major challenge with FQT is the lack of theoretical understanding, in particular of how gradient quantization impacts convergence properties.
arXiv Detail & Related papers (2020-10-27T13:57:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.