Understanding the Impact of Post-Training Quantization on Large Language
Models
- URL: http://arxiv.org/abs/2309.05210v3
- Date: Sun, 17 Sep 2023 05:38:46 GMT
- Title: Understanding the Impact of Post-Training Quantization on Large Language
Models
- Authors: Somnath Roy
- Abstract summary: The study identifies nf4 as displaying greater resilience to temperature variations in the case of the llama2 series of models at lower temperature.
Int8 quantization is associated with significantly slower inference speeds, whereas unquantized bfloat16 models consistently yield the fastest inference speeds across models of all sizes.
- Score: 0.38073142980732994
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are rapidly increasing in size, with the number
of parameters becoming a key factor in the success of many commercial models,
such as ChatGPT, Claude, and Bard. Even the recently released publicly
accessible models for commercial usage, such as Falcon and Llama2, come
equipped with billions of parameters. This significant increase in the number
of parameters makes deployment and operation very costly. The remarkable
progress in the field of quantization for large neural networks in general and
LLMs in particular, has made these models more accessible by enabling them to
be deployed on consumer-grade GPUs. Quantized models generally demonstrate
comparable performance levels to their unquantized base counterparts.
Nonetheless, there exists a notable gap in our comprehensive understanding of
how these quantized models respond to hyperparameters, such as temperature, max
new tokens, and topk, particularly for next word prediction. The present
analysis reveals that nf4 and fp4 are equally proficient 4-bit quantization
techniques, characterized by similar attributes such as inference speed, memory
consumption, and the quality of generated content. the study identifies nf4 as
displaying greater resilience to temperature variations in the case of the
llama2 series of models at lower temperature, while fp4 and fp4-dq proves to be
a more suitable choice for falcon series of models. It is noteworthy that, in
general, 4-bit quantized models of varying sizes exhibit higher sensitivity to
temperature in the range of 0.5 to 0.8, unlike their unquantized counterparts.
Additionally, int8 quantization is associated with significantly slower
inference speeds, whereas unquantized bfloat16 models consistently yield the
fastest inference speeds across models of all sizes.
Related papers
- Matryoshka Quantization [19.46665026740268]
Quantizing model weights is critical for reducing the communication and inference costs of large models.
Matryoshka Quantization (MatQuant) is a novel multi-scale quantization technique that addresses the challenge of needing multiple quantized models.
MatQuant can be up to $10%$ more accurate than standard int2 quantization.
arXiv Detail & Related papers (2025-02-10T18:59:10Z) - ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization [58.84018707089315]
We present a unified framework for rigorous comparisons across 1-bit, 1.58-bit, 2-bit, 3-bit, and 4-bit quantization settings.
We show that ternary, 2-bit, and 3-bit quantization maintains comparable performance in the size-accuracy trade-off.
Considering hardware constraints, 2-bit quantization offers promising potential for memory reduction and speedup.
arXiv Detail & Related papers (2025-02-04T18:59:26Z) - FP=xINT:A Low-Bit Series Expansion Algorithm for Post-Training Quantization [3.560046736432574]
Post-Training Quantization (PTQ) converts pre-trained Full-Precision (FP) models into quantized versions without training.
Existing methods significantly degrade performance and quantization efficiency at extremely low settings due to quantization noise.
We introduce a deep model series expansion framework to address this issue, enabling rapid and accurate approximation of unquantized models without calibration sets or fine-tuning.
arXiv Detail & Related papers (2024-12-09T08:50:28Z) - QLoRA: Efficient Finetuning of Quantized LLMs [66.58009990713134]
We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU.
QLoRA backpropagates through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters(LoRA)
Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark.
arXiv Detail & Related papers (2023-05-23T17:50:33Z) - The case for 4-bit precision: k-bit Inference Scaling Laws [75.4335600212427]
Quantization methods reduce the number of bits required to represent each parameter in a model.
The final model size depends on both the number of parameters of the original model and the rate of compression.
We run more than 35,000 zero-shot experiments with 16-bit inputs and k-bit parameters to examine which quantization methods improve scaling for 3 to 8-bit precision.
arXiv Detail & Related papers (2022-12-19T18:48:33Z) - Characterizing and Understanding the Behavior of Quantized Models for
Reliable Deployment [32.01355605506855]
Quantization-aware training can produce more stable models than standard, adversarial, and Mixup training.
Disagreements often have closer top-1 and top-2 output probabilities, and $Margin$ is a better indicator than the other uncertainty metrics to distinguish disagreements.
We opensource our code and models as a new benchmark for further studying the quantized models.
arXiv Detail & Related papers (2022-04-08T11:19:16Z) - Differentiable Model Compression via Pseudo Quantization Noise [99.89011673907814]
We propose to add independent pseudo quantization noise to model parameters during training to approximate the effect of a quantization operator.
We experimentally verify that our method outperforms state-of-the-art quantization techniques on several benchmarks and architectures for image classification, language modeling, and audio source separation.
arXiv Detail & Related papers (2021-04-20T14:14:03Z) - Quaternion Factorization Machines: A Lightweight Solution to Intricate
Feature Interaction Modelling [76.89779231460193]
factorization machine (FM) is capable of automatically learning high-order interactions among features to make predictions without the need for manual feature engineering.
We propose the quaternion factorization machine (QFM) and quaternion neural factorization machine (QNFM) for sparse predictive analytics.
arXiv Detail & Related papers (2021-04-05T00:02:36Z) - Q-ASR: Integer-only Zero-shot Quantization for Efficient Speech
Recognition [65.7040645560855]
We propose Q-ASR, an integer-only, zero-shot quantization scheme for ASR models.
We show negligible WER change as compared to the full-precision baseline models.
Q-ASR exhibits a large compression rate of more than 4x with small WER degradation.
arXiv Detail & Related papers (2021-03-31T06:05:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.