Related papers: Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models

Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models

URL: http://arxiv.org/abs/2305.12356v1
Date: Sun, 21 May 2023 05:28:37 GMT
Title: Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models
Authors: Yijia Zhang, Lingran Zhao, Shijie Cao, Wenqiang Wang, Ting Cao, Fan Yang, Mao Yang, Shanghang Zhang, Ningyi Xu
Abstract summary: Low-bit integer formats (e.g., INT8/INT4) have been the conventional choice for large language models (LLMs) Low-bit floating-point formats (e.g., FP8/FP4) offer a compelling alternative and are gaining support from cutting-edge hardware, such as NVIDIA's H100 GPU. We propose the Mixture of Formats Quantization (MoFQ), which selects the optimal format on a layer-wise basis.
Score: 17.055400141733124
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Efficient deployment of large language models (LLMs) necessitates low-bit quantization to minimize model size and inference cost. While low-bit integer formats (e.g., INT8/INT4) have been the conventional choice, emerging low-bit floating-point formats (e.g., FP8/FP4) offer a compelling alternative and are gaining support from cutting-edge hardware, such as NVIDIA's H100 GPU. However, the superiority of low-bit INT versus FP formats for quantization on LLMs remains unclear. In this study, we conduct a comparative analysis of INT and FP quantization with the same bit-width, revealing that the optimal quantization format varies across different layers due to the complexity and diversity of tensor distribution. Consequently, we advocate the Mixture of Formats Quantization (MoFQ), which selects the optimal format on a layer-wise basis. This simple yet effective approach achieves state-of-the-art results in both weight-only (W-only) and weight-activation (WA) post-training quantization scenarios when tested on LLaMA across various tasks. In 4-bit W-only quantization, MoFQ surpasses GPTQ without complex hyperparameter tuning and with an order of magnitude faster quantization speed. While in 8-bit WA quantization, MoFQ significantly outperforms INT/FP-only methods, achieving performance close to the full precision model. Notably, MoFQ incurs no hardware overhead compared to INT/FP-only quantization, as the bit-width remains unchanged.

Related papers

The Power of Negative Zero: Datatype Customization for Quantized Large Language Models [5.503925076208333]
Post-training quantization serves as one of the most hardware-efficient methods to mitigate the memory and computational demands of large language models (LLMs) In this paper, we extend the basic FP datatype to perform Redundant Zero Remapping (RaZeR) RaZeR remaps the negative zero FP encoding to a set of pre-defined special values to maximally utilize FP quantization encodings and to better fit numerical distributions.
arXiv Detail & Related papers (2025-01-06T22:40:40Z)
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
We evaluate popular quantization formats across academic benchmarks and real-world tasks. We find that W4A16 offers the best costefficiency for synchronous deployments, and for asynchronous deployment on mid-tier architectures.
arXiv Detail & Related papers (2024-11-04T18:21:59Z)
Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization [62.15918574997175]
It is known that language models contain outlier channels whose values on average are orders of magnitude higher than other channels. We propose a strategy which regularizes a layer's inputs via quantization-aware training (QAT) and its outputs via activation kurtosis regularization. We show that regularizing both the inputs and outputs is crucial for preventing a model's "migrating" the difficulty in input quantization to the weights.
arXiv Detail & Related papers (2024-04-04T17:25:30Z)
ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks [31.431016659268206]
This study examines 4-bit quantization methods like GPTQ in large language models (LLMs) We extend task scope to more generative categories such as code generation and abstractive summarization. We propose a novel 4+2 design for FP6 to achieve similar latency to the state-of-the-art INT4 fine-grain quantization.
arXiv Detail & Related papers (2023-12-14T01:06:37Z)
AFPQ: Asymmetric Floating Point Quantization for LLMs [6.176074875528637]
We propose asymmetric FP quantization (AFPQ), which sets separate scales for positive and negative values. Our method leads to large accuracy improvements and can be easily plugged into other quantization methods, including GPTQ and AWQ. No additional storage is needed compared with asymmetric integer (INT) quantization.
arXiv Detail & Related papers (2023-11-03T09:07:09Z)
LLM-FP4: 4-Bit Floating-Point Quantized Transformers [38.23587031169402]
We propose LLM-FP4 for quantizing both weights and activations in large language models (LLMs) down to 4-bit floating-point values. Compared to integer quantization, floating-point (FP) quantization is more flexible and can better handle long-tail or bell-shaped distributions. Our method, for the first time, can quantize both weights and activations in the LLaMA-13B to only 4-bit and achieves an average score of 63.1.
arXiv Detail & Related papers (2023-10-25T17:59:32Z)
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
Large language models (LLMs) have revolutionized natural language processing tasks. Recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM. We introduce an Omnidirectionally calibrated Quantization technique for LLMs, which achieves good performance in diverse quantization settings.
arXiv Detail & Related papers (2023-08-25T02:28:35Z)
ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats [25.543571445739936]
This study explores the viability of floating-point (FP) quantization for large language models (LLMs) For LLMs, FP8 activation consistently outshines its integer (INT8) equivalent, with the performance edge becoming more noticeable in models possessing parameters beyond one billion. For weight quantization, our findings indicate that FP4 exhibits comparable, if not superior, performance to INT4, simplifying deployment on FP-supported hardware like H100.
arXiv Detail & Related papers (2023-07-19T06:58:03Z)
Distribution-Flexible Subset Quantization for Post-Quantizing Super-Resolution Networks [68.83451203841624]
This paper introduces Distribution-Flexible Subset Quantization (DFSQ), a post-training quantization method for super-resolution networks. DFSQ conducts channel-wise normalization of the activations and applies distribution-flexible subset quantization (SQ) It achieves comparable performance to full-precision counterparts on 6- and 8-bit quantization, and incurs only a 0.1 dB PSNR drop on 4-bit quantization.
arXiv Detail & Related papers (2023-05-10T04:19:11Z)
Towards Efficient Post-training Quantization of Pre-trained Language Models [85.68317334241287]
We study post-training quantization(PTQ) of PLMs, and propose module-wise quantization error minimization(MREM), an efficient solution to mitigate these issues. Experiments on GLUE and SQuAD benchmarks show that our proposed PTQ solution not only performs close to QAT, but also enjoys significant reductions in training time, memory overhead, and data consumption.
arXiv Detail & Related papers (2021-09-30T12:50:06Z)
HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values. We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.