Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization
- URL: http://arxiv.org/abs/2404.03605v2
- Date: Mon, 26 Aug 2024 20:48:19 GMT
- Title: Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization
- Authors: Aniruddha Nrusimha, Mayank Mishra, Naigang Wang, Dan Alistarh, Rameswar Panda, Yoon Kim,
- Abstract summary: It is known that language models contain outlier channels whose values on average are orders of magnitude higher than other channels.
We propose a strategy which regularizes a layer's inputs via quantization-aware training (QAT) and its outputs via activation kurtosis regularization.
We show that regularizing both the inputs and outputs is crucial for preventing a model's "migrating" the difficulty in input quantization to the weights.
- Score: 62.15918574997175
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider the problem of accurate quantization for language models, where both the weights and activations are uniformly quantized to 4 bits per parameter, the lowest bitwidth format natively supported by GPU hardware. In this context, the key challenge is activation quantization: it is known that language models contain outlier channels whose values on average are orders of magnitude higher than than other channels, which prevents accurate low-bitwidth quantization with known techniques. We systematically study this phenomena and find that these outlier channels emerge early in training, and that they occur more frequently in layers with residual streams. We then propose a simple strategy which regularizes a layer's inputs via quantization-aware training (QAT) and its outputs via activation kurtosis regularization. We show that regularizing both the inputs and outputs is crucial for preventing a model's "migrating" the difficulty in input quantization to the weights, which makes post-training quantization (PTQ) of weights more difficult. When combined with weight PTQ, we show that our approach can obtain a W4A4 model that performs competitively to the standard-precision W16A16 baseline.
Related papers
- GWQ: Gradient-Aware Weight Quantization for Large Language Models [61.17678373122165]
gradient-aware weight quantization (GWQ) is the first quantization approach for low-bit weight quantization that leverages gradients to localize outliers.
GWQ retains the corresponding to the top 1% outliers preferentially at FP16 precision, while the remaining non-outlier weights are stored in a low-bit format.
In the zero-shot task, GWQ quantized models have higher accuracy compared to other quantization methods.
arXiv Detail & Related papers (2024-10-30T11:16:04Z) - PTQ4DiT: Post-training Quantization for Diffusion Transformers [52.902071948957186]
Post-training Quantization (PTQ) has emerged as a fast and data-efficient solution that can significantly reduce computation and memory footprint.
We propose PTQ4DiT, a specifically designed PTQ method for DiTs.
We demonstrate that our PTQ4DiT successfully quantizes DiTs to 8-bit precision while preserving comparable generation ability.
arXiv Detail & Related papers (2024-05-25T02:02:08Z) - LLM-FP4: 4-Bit Floating-Point Quantized Transformers [38.23587031169402]
We propose LLM-FP4 for quantizing both weights and activations in large language models (LLMs) down to 4-bit floating-point values.
Compared to integer quantization, floating-point (FP) quantization is more flexible and can better handle long-tail or bell-shaped distributions.
Our method, for the first time, can quantize both weights and activations in the LLaMA-13B to only 4-bit and achieves an average score of 63.1.
arXiv Detail & Related papers (2023-10-25T17:59:32Z) - QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language
Models [57.04178959678024]
We show that the majority of inference computations for large generative models can be performed with both weights and activations being cast to 4 bits.
We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit.
We provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x.
arXiv Detail & Related papers (2023-10-13T17:15:05Z) - Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models [7.485068491216164]
Large Language Models (LLMs) have recently demonstrated remarkable success across various tasks.
Weight-only quantization can be a promising approach, but sub-4 bit quantization remains a challenge due to large-magnitude activation outliers.
We propose per-IC quantization, a simple yet effective method that creates quantization groups within each input channel.
arXiv Detail & Related papers (2023-09-27T09:48:31Z) - Norm Tweaking: High-performance Low-bit Quantization of Large Language
Models [21.855106896725598]
We introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision.
Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations.
Our simple and effective approach makes it more practical for real-world applications.
arXiv Detail & Related papers (2023-09-06T06:51:15Z) - FPTQ: Fine-grained Post-Training Quantization for Large Language Models [28.11564378745513]
We propose a novel W4A8 post-training quantization method for the available open-sourced LLMs.
We obtain the state-of-the-art W4A8 quantized performance on BLOOM, LLaMA, and LLaMA-2 on standard benchmarks.
arXiv Detail & Related papers (2023-08-30T12:18:18Z) - Outlier Suppression+: Accurate quantization of large language models by
equivalent and optimal shifting and scaling [44.60348333479704]
Post-training quantization of transformer language models faces challenges due to existence of detrimental outliers in activations.
We propose the Outlier Suppression+(OS+) framework, which contains the channel-wise shifting for asymmetry and channel-wise scaling for concentration.
We show that these operations can be seamlessly migrated into subsequent modules while maintaining equivalence.
arXiv Detail & Related papers (2023-04-18T17:34:23Z) - Q-ASR: Integer-only Zero-shot Quantization for Efficient Speech
Recognition [65.7040645560855]
We propose Q-ASR, an integer-only, zero-shot quantization scheme for ASR models.
We show negligible WER change as compared to the full-precision baseline models.
Q-ASR exhibits a large compression rate of more than 4x with small WER degradation.
arXiv Detail & Related papers (2021-03-31T06:05:40Z) - Direct Quantization for Training Highly Accurate Low Bit-width Deep
Neural Networks [73.29587731448345]
This paper proposes two novel techniques to train deep convolutional neural networks with low bit-width weights and activations.
First, to obtain low bit-width weights, most existing methods obtain the quantized weights by performing quantization on the full-precision network weights.
Second, to obtain low bit-width activations, existing works consider all channels equally.
arXiv Detail & Related papers (2020-12-26T15:21:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.