Integer or Floating Point? New Outlooks for Low-Bit Quantization on
Large Language Models
- URL: http://arxiv.org/abs/2305.12356v1
- Date: Sun, 21 May 2023 05:28:37 GMT
- Title: Integer or Floating Point? New Outlooks for Low-Bit Quantization on
Large Language Models
- Authors: Yijia Zhang, Lingran Zhao, Shijie Cao, Wenqiang Wang, Ting Cao, Fan
Yang, Mao Yang, Shanghang Zhang, Ningyi Xu
- Abstract summary: Low-bit integer formats (e.g., INT8/INT4) have been the conventional choice for large language models (LLMs)
Low-bit floating-point formats (e.g., FP8/FP4) offer a compelling alternative and are gaining support from cutting-edge hardware, such as NVIDIA's H100 GPU.
We propose the Mixture of Formats Quantization (MoFQ), which selects the optimal format on a layer-wise basis.
- Score: 17.055400141733124
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Efficient deployment of large language models (LLMs) necessitates low-bit
quantization to minimize model size and inference cost. While low-bit integer
formats (e.g., INT8/INT4) have been the conventional choice, emerging low-bit
floating-point formats (e.g., FP8/FP4) offer a compelling alternative and are
gaining support from cutting-edge hardware, such as NVIDIA's H100 GPU. However,
the superiority of low-bit INT versus FP formats for quantization on LLMs
remains unclear. In this study, we conduct a comparative analysis of INT and FP
quantization with the same bit-width, revealing that the optimal quantization
format varies across different layers due to the complexity and diversity of
tensor distribution. Consequently, we advocate the Mixture of Formats
Quantization (MoFQ), which selects the optimal format on a layer-wise basis.
This simple yet effective approach achieves state-of-the-art results in both
weight-only (W-only) and weight-activation (WA) post-training quantization
scenarios when tested on LLaMA across various tasks. In 4-bit W-only
quantization, MoFQ surpasses GPTQ without complex hyperparameter tuning and
with an order of magnitude faster quantization speed. While in 8-bit WA
quantization, MoFQ significantly outperforms INT/FP-only methods, achieving
performance close to the full precision model. Notably, MoFQ incurs no hardware
overhead compared to INT/FP-only quantization, as the bit-width remains
unchanged.
Related papers
- "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
We evaluate popular quantization formats across academic benchmarks and real-world tasks.
We find that W4A16 offers the best costefficiency for synchronous deployments, and for asynchronous deployment on mid-tier architectures.
arXiv Detail & Related papers (2024-11-04T18:21:59Z) - Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization [62.15918574997175]
It is known that language models contain outlier channels whose values on average are orders of magnitude higher than other channels.
We propose a strategy which regularizes a layer's inputs via quantization-aware training (QAT) and its outputs via activation kurtosis regularization.
We show that regularizing both the inputs and outputs is crucial for preventing a model's "migrating" the difficulty in input quantization to the weights.
arXiv Detail & Related papers (2024-04-04T17:25:30Z) - ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric
Strategy for Diverse Generative Tasks [31.431016659268206]
This study examines 4-bit quantization methods like GPTQ in large language models (LLMs)
We extend task scope to more generative categories such as code generation and abstractive summarization.
We propose a novel 4+2 design for FP6 to achieve similar latency to the state-of-the-art INT4 fine-grain quantization.
arXiv Detail & Related papers (2023-12-14T01:06:37Z) - AFPQ: Asymmetric Floating Point Quantization for LLMs [6.176074875528637]
We propose asymmetric FP quantization (AFPQ), which sets separate scales for positive and negative values.
Our method leads to large accuracy improvements and can be easily plugged into other quantization methods, including GPTQ and AWQ.
No additional storage is needed compared with asymmetric integer (INT) quantization.
arXiv Detail & Related papers (2023-11-03T09:07:09Z) - LLM-FP4: 4-Bit Floating-Point Quantized Transformers [38.23587031169402]
We propose LLM-FP4 for quantizing both weights and activations in large language models (LLMs) down to 4-bit floating-point values.
Compared to integer quantization, floating-point (FP) quantization is more flexible and can better handle long-tail or bell-shaped distributions.
Our method, for the first time, can quantize both weights and activations in the LLaMA-13B to only 4-bit and achieves an average score of 63.1.
arXiv Detail & Related papers (2023-10-25T17:59:32Z) - OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
Large language models (LLMs) have revolutionized natural language processing tasks.
Recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM.
We introduce an Omnidirectionally calibrated Quantization technique for LLMs, which achieves good performance in diverse quantization settings.
arXiv Detail & Related papers (2023-08-25T02:28:35Z) - ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization
Using Floating-Point Formats [25.543571445739936]
This study explores the viability of floating-point (FP) quantization for large language models (LLMs)
For LLMs, FP8 activation consistently outshines its integer (INT8) equivalent, with the performance edge becoming more noticeable in models possessing parameters beyond one billion.
For weight quantization, our findings indicate that FP4 exhibits comparable, if not superior, performance to INT4, simplifying deployment on FP-supported hardware like H100.
arXiv Detail & Related papers (2023-07-19T06:58:03Z) - Distribution-Flexible Subset Quantization for Post-Quantizing
Super-Resolution Networks [68.83451203841624]
This paper introduces Distribution-Flexible Subset Quantization (DFSQ), a post-training quantization method for super-resolution networks.
DFSQ conducts channel-wise normalization of the activations and applies distribution-flexible subset quantization (SQ)
It achieves comparable performance to full-precision counterparts on 6- and 8-bit quantization, and incurs only a 0.1 dB PSNR drop on 4-bit quantization.
arXiv Detail & Related papers (2023-05-10T04:19:11Z) - Towards Efficient Post-training Quantization of Pre-trained Language
Models [85.68317334241287]
We study post-training quantization(PTQ) of PLMs, and propose module-wise quantization error minimization(MREM), an efficient solution to mitigate these issues.
Experiments on GLUE and SQuAD benchmarks show that our proposed PTQ solution not only performs close to QAT, but also enjoys significant reductions in training time, memory overhead, and data consumption.
arXiv Detail & Related papers (2021-09-30T12:50:06Z) - HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.
We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.