The case for 4-bit precision: k-bit Inference Scaling Laws
- URL: http://arxiv.org/abs/2212.09720v1
- Date: Mon, 19 Dec 2022 18:48:33 GMT
- Title: The case for 4-bit precision: k-bit Inference Scaling Laws
- Authors: Tim Dettmers, Luke Zettlemoyer
- Abstract summary: Quantization methods reduce the number of bits required to represent each parameter in a model.
The final model size depends on both the number of parameters of the original model and the rate of compression.
We run more than 35,000 zero-shot experiments with 16-bit inputs and k-bit parameters to examine which quantization methods improve scaling for 3 to 8-bit precision.
- Score: 75.4335600212427
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Quantization methods reduce the number of bits required to represent each
parameter in a model, trading accuracy for smaller memory footprints and
inference latencies. However, the final model size depends on both the number
of parameters of the original model and the rate of compression. For example, a
30B 8-bit model and a 60B 4-bit model have the same number of bits but may have
very different zero-shot accuracies. In this work, we study this trade-off by
developing inference scaling laws of zero-shot performance in Large Language
Models (LLMs) to determine the bit-precision and model size that maximizes
zero-shot performance. We run more than 35,000 zero-shot experiments with
16-bit inputs and k-bit parameters to examine which quantization methods
improve scaling for 3 to 8-bit precision at scales of 19M to 66B parameters
across the LLM families BLOOM, OPT, NeoX/Pythia, and GPT-2. We find that it is
challenging to improve the bit-level scaling trade-off, with the only
improvements being the use of a small block size -- splitting the parameters
into small independently quantized blocks -- and the quantization data type
being used (e.g., Int vs Float). Overall, our findings show that 4-bit
precision is almost universally optimal for total model bits and zero-shot
accuracy.
Related papers
- Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs [39.410068572891475]
Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead.
Recent works have investigated adopting 8-bit floating-point formats(FP8) in the context of PTQ for model inference.
We present minifloats, which are reduced-precision floating-point formats capable of further reducing the memory footprint, latency, and energy cost of a model.
arXiv Detail & Related papers (2023-11-21T05:27:16Z) - Memory Efficient Optimizers with 4-bit States [22.605392665667136]
We push states bitwidth down to 4-bit through a detailed empirical analysis of first and second moments.
We use a smaller block size and propose to utilize both row-wise and column-wise information for better quantization.
Our 4-bits are evaluated on a wide variety of benchmarks including natural language understanding, machine translation, image classification, and instruction tuning.
arXiv Detail & Related papers (2023-09-04T10:27:17Z) - QLoRA: Efficient Finetuning of Quantized LLMs [66.58009990713134]
We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU.
QLoRA backpropagates through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters(LoRA)
Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark.
arXiv Detail & Related papers (2023-05-23T17:50:33Z) - DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural
Network Inference [28.912023025671868]
This work targets an adaptive data representation with variable-length encoding called DyBit.
We also propose a hardware-aware quantization framework with a mixed-precision accelerator to trade-off the inference accuracy and speedup.
Experimental results demonstrate that the inference accuracy via DyBit is 1.997% higher than the state-of-the-art at 4-bit quantization.
arXiv Detail & Related papers (2023-02-24T08:46:01Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - Pruning Ternary Quantization [32.32812780843498]
Inference time, model size, and accuracy are three key factors in deep model compression.
We propose pruning ternary quantization (PTQ): a simple, effective, symmetric ternary quantization method.
Our method is verified on image classification, object detection/segmentation tasks with different network structures.
arXiv Detail & Related papers (2021-07-23T02:18:00Z) - Differentiable Model Compression via Pseudo Quantization Noise [99.89011673907814]
We propose to add independent pseudo quantization noise to model parameters during training to approximate the effect of a quantization operator.
We experimentally verify that our method outperforms state-of-the-art quantization techniques on several benchmarks and architectures for image classification, language modeling, and audio source separation.
arXiv Detail & Related papers (2021-04-20T14:14:03Z) - Q-ASR: Integer-only Zero-shot Quantization for Efficient Speech
Recognition [65.7040645560855]
We propose Q-ASR, an integer-only, zero-shot quantization scheme for ASR models.
We show negligible WER change as compared to the full-precision baseline models.
Q-ASR exhibits a large compression rate of more than 4x with small WER degradation.
arXiv Detail & Related papers (2021-03-31T06:05:40Z) - HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.
We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.