Related papers: The case for 4-bit precision: k-bit Inference Scaling Laws

The case for 4-bit precision: k-bit Inference Scaling Laws

URL: http://arxiv.org/abs/2212.09720v1
Date: Mon, 19 Dec 2022 18:48:33 GMT
Title: The case for 4-bit precision: k-bit Inference Scaling Laws
Authors: Tim Dettmers, Luke Zettlemoyer
Abstract summary: Quantization methods reduce the number of bits required to represent each parameter in a model. The final model size depends on both the number of parameters of the original model and the rate of compression. We run more than 35,000 zero-shot experiments with 16-bit inputs and k-bit parameters to examine which quantization methods improve scaling for 3 to 8-bit precision.
Score: 75.4335600212427
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Quantization methods reduce the number of bits required to represent each parameter in a model, trading accuracy for smaller memory footprints and inference latencies. However, the final model size depends on both the number of parameters of the original model and the rate of compression. For example, a 30B 8-bit model and a 60B 4-bit model have the same number of bits but may have very different zero-shot accuracies. In this work, we study this trade-off by developing inference scaling laws of zero-shot performance in Large Language Models (LLMs) to determine the bit-precision and model size that maximizes zero-shot performance. We run more than 35,000 zero-shot experiments with 16-bit inputs and k-bit parameters to examine which quantization methods improve scaling for 3 to 8-bit precision at scales of 19M to 66B parameters across the LLM families BLOOM, OPT, NeoX/Pythia, and GPT-2. We find that it is challenging to improve the bit-level scaling trade-off, with the only improvements being the use of a small block size -- splitting the parameters into small independently quantized blocks -- and the quantization data type being used (e.g., Int vs Float). Overall, our findings show that 4-bit precision is almost universally optimal for total model bits and zero-shot accuracy.

Related papers

BitTTS: Highly Compact Text-to-Speech Using 1.58-bit Quantization and Weight Indexing [8.513851383288067]
This paper proposes a highly compact, lightweight text-to-speech (TTS) model for on-device applications.<n>We introduce quantization-aware training (QAT), which quantizes model parameters during training to as low as 1.58-bit.<n>We also propose a method named weight indexing, which saves a group of 1.58-bit weights as a single int8 index.
arXiv Detail & Related papers (2025-06-04T03:02:18Z)
ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization [58.84018707089315]
We present a unified framework for rigorous comparisons across 1-bit, 1.58-bit, 2-bit, 3-bit, and 4-bit quantization settings. We show that ternary, 2-bit, and 3-bit quantization maintains comparable performance in the size-accuracy trade-off. Considering hardware constraints, 2-bit quantization offers promising potential for memory reduction and speedup.
arXiv Detail & Related papers (2025-02-04T18:59:26Z)
Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs [39.410068572891475]
Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point formats(FP8) in the context of PTQ for model inference. We present minifloats, which are reduced-precision floating-point formats capable of further reducing the memory footprint, latency, and energy cost of a model.
arXiv Detail & Related papers (2023-11-21T05:27:16Z)
Memory Efficient Optimizers with 4-bit States [22.605392665667136]
We push states bitwidth down to 4-bit through a detailed empirical analysis of first and second moments. We use a smaller block size and propose to utilize both row-wise and column-wise information for better quantization. Our 4-bits are evaluated on a wide variety of benchmarks including natural language understanding, machine translation, image classification, and instruction tuning.
arXiv Detail & Related papers (2023-09-04T10:27:17Z)
QLoRA: Efficient Finetuning of Quantized LLMs [66.58009990713134]
We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU. QLoRA backpropagates through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters(LoRA) Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark.
arXiv Detail & Related papers (2023-05-23T17:50:33Z)
DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural Network Inference [28.912023025671868]
This work targets an adaptive data representation with variable-length encoding called DyBit. We also propose a hardware-aware quantization framework with a mixed-precision accelerator to trade-off the inference accuracy and speedup. Experimental results demonstrate that the inference accuracy via DyBit is 1.997% higher than the state-of-the-art at 4-bit quantization.
arXiv Detail & Related papers (2023-02-24T08:46:01Z)
8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values. This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters. In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z)
Pruning Ternary Quantization [32.32812780843498]
Inference time, model size, and accuracy are three key factors in deep model compression. We propose pruning ternary quantization (PTQ): a simple, effective, symmetric ternary quantization method. Our method is verified on image classification, object detection/segmentation tasks with different network structures.
arXiv Detail & Related papers (2021-07-23T02:18:00Z)
Differentiable Model Compression via Pseudo Quantization Noise [99.89011673907814]
We propose to add independent pseudo quantization noise to model parameters during training to approximate the effect of a quantization operator. We experimentally verify that our method outperforms state-of-the-art quantization techniques on several benchmarks and architectures for image classification, language modeling, and audio source separation.
arXiv Detail & Related papers (2021-04-20T14:14:03Z)
Q-ASR: Integer-only Zero-shot Quantization for Efficient Speech Recognition [65.7040645560855]
We propose Q-ASR, an integer-only, zero-shot quantization scheme for ASR models. We show negligible WER change as compared to the full-precision baseline models. Q-ASR exhibits a large compression rate of more than 4x with small WER degradation.
arXiv Detail & Related papers (2021-03-31T06:05:40Z)
HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values. We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.