ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization
Using Floating-Point Formats
- URL: http://arxiv.org/abs/2307.09782v2
- Date: Thu, 20 Jul 2023 18:47:20 GMT
- Title: ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization
Using Floating-Point Formats
- Authors: Xiaoxia Wu and Zhewei Yao and Yuxiong He
- Abstract summary: This study explores the viability of floating-point (FP) quantization for large language models (LLMs)
For LLMs, FP8 activation consistently outshines its integer (INT8) equivalent, with the performance edge becoming more noticeable in models possessing parameters beyond one billion.
For weight quantization, our findings indicate that FP4 exhibits comparable, if not superior, performance to INT4, simplifying deployment on FP-supported hardware like H100.
- Score: 25.543571445739936
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the complex domain of large language models (LLMs), striking a balance
between computational efficiency and maintaining model quality is a formidable
challenge. Navigating the inherent limitations of uniform quantization,
particularly when dealing with outliers, and motivated by the launch of
NVIDIA's H100 hardware, this study delves into the viability of floating-point
(FP) quantization, particularly focusing on FP8 and FP4, as a potential
solution. Our comprehensive investigation reveals that for LLMs, FP8 activation
consistently outshines its integer (INT8) equivalent, with the performance edge
becoming more noticeable in models possessing parameters beyond one billion.
For weight quantization, our findings indicate that FP4 exhibits comparable, if
not superior, performance to INT4, simplifying deployment on FP-supported
hardware like H100. To mitigate the overhead from precision alignment caused by
the disparity between weights and activations, we propose two scaling
constraints for weight quantization that negligibly impact the performance
compared to the standard W4A8 model. We additionally enhance our quantization
methods by integrating the Low Rank Compensation (LoRC) strategy, yielding
improvements especially in smaller models. The results of our investigation
emphasize the immense potential of FP quantization for LLMs, paving the way for
high-efficiency deployment in resource-limited settings.
Related papers
- "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
We evaluate popular quantization formats across academic benchmarks and real-world tasks.
We find that W4A16 offers the best costefficiency for synchronous deployments, and for asynchronous deployment on mid-tier architectures.
arXiv Detail & Related papers (2024-11-04T18:21:59Z) - SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [67.67135738642547]
Post-training quantization (PTQ) is a powerful compression technique investigated in large language models (LLMs)
Existing PTQ methods are not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths.
This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving [52.31791050376249]
Quantization can accelerate large language model (LLM) inference.
Existing INT4 quantization methods suffer from significant runtime overhead when dequantizing weights or partial sums.
We introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache.
QServe improves the maximum achievable serving of Llama-3-8B by 1.2x on A100, 1.4x on L40S; and Qwen-721.5B by 2.4x on A100, 3.5x on L40S.
arXiv Detail & Related papers (2024-05-07T17:59:30Z) - AffineQuant: Affine Transformation Quantization for Large Language Models [58.45460102764]
Post-Training Quantization (PTQ) has emerged as a subject of considerable interest due to its compression efficiency and cost-effectiveness in the context of training.
Existing PTQ methods for Large-scale Language Models (LLMs) limit the optimization scope to scaling transformations between pre- and post-quantization weights.
In this paper, we advocate for the direct optimization using equivalent Affine transformations in PTQ (AffineQuant)
arXiv Detail & Related papers (2024-03-19T08:40:21Z) - ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric
Strategy for Diverse Generative Tasks [31.431016659268206]
This study examines 4-bit quantization methods like GPTQ in large language models (LLMs)
We extend task scope to more generative categories such as code generation and abstractive summarization.
We propose a novel 4+2 design for FP6 to achieve similar latency to the state-of-the-art INT4 fine-grain quantization.
arXiv Detail & Related papers (2023-12-14T01:06:37Z) - Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization [12.655230451207956]
This paper focuses on post-training quantization (PTQ) in Large Language Models (LLMs)
We present two innovative techniques: activation-quantization-aware scaling (AQAS) and sequence-length-aware calibration (SLAC)
We demonstrate that our techniques significantly boost task accuracies to levels comparable with full-precision models.
arXiv Detail & Related papers (2023-11-09T06:19:51Z) - LLM-FP4: 4-Bit Floating-Point Quantized Transformers [38.23587031169402]
We propose LLM-FP4 for quantizing both weights and activations in large language models (LLMs) down to 4-bit floating-point values.
Compared to integer quantization, floating-point (FP) quantization is more flexible and can better handle long-tail or bell-shaped distributions.
Our method, for the first time, can quantize both weights and activations in the LLaMA-13B to only 4-bit and achieves an average score of 63.1.
arXiv Detail & Related papers (2023-10-25T17:59:32Z) - FPTQ: Fine-grained Post-Training Quantization for Large Language Models [28.11564378745513]
We propose a novel W4A8 post-training quantization method for the available open-sourced LLMs.
We obtain the state-of-the-art W4A8 quantized performance on BLOOM, LLaMA, and LLaMA-2 on standard benchmarks.
arXiv Detail & Related papers (2023-08-30T12:18:18Z) - OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
Large language models (LLMs) have revolutionized natural language processing tasks.
Recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM.
We introduce an Omnidirectionally calibrated Quantization technique for LLMs, which achieves good performance in diverse quantization settings.
arXiv Detail & Related papers (2023-08-25T02:28:35Z) - Quantized Neural Networks for Low-Precision Accumulation with Guaranteed
Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference.
We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.