Related papers: ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats

ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats

URL: http://arxiv.org/abs/2307.09782v2
Date: Thu, 20 Jul 2023 18:47:20 GMT
Title: ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats
Authors: Xiaoxia Wu and Zhewei Yao and Yuxiong He
Abstract summary: This study explores the viability of floating-point (FP) quantization for large language models (LLMs) For LLMs, FP8 activation consistently outshines its integer (INT8) equivalent, with the performance edge becoming more noticeable in models possessing parameters beyond one billion. For weight quantization, our findings indicate that FP4 exhibits comparable, if not superior, performance to INT4, simplifying deployment on FP-supported hardware like H100.
Score: 25.543571445739936
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In the complex domain of large language models (LLMs), striking a balance between computational efficiency and maintaining model quality is a formidable challenge. Navigating the inherent limitations of uniform quantization, particularly when dealing with outliers, and motivated by the launch of NVIDIA's H100 hardware, this study delves into the viability of floating-point (FP) quantization, particularly focusing on FP8 and FP4, as a potential solution. Our comprehensive investigation reveals that for LLMs, FP8 activation consistently outshines its integer (INT8) equivalent, with the performance edge becoming more noticeable in models possessing parameters beyond one billion. For weight quantization, our findings indicate that FP4 exhibits comparable, if not superior, performance to INT4, simplifying deployment on FP-supported hardware like H100. To mitigate the overhead from precision alignment caused by the disparity between weights and activations, we propose two scaling constraints for weight quantization that negligibly impact the performance compared to the standard W4A8 model. We additionally enhance our quantization methods by integrating the Low Rank Compensation (LoRC) strategy, yielding improvements especially in smaller models. The results of our investigation emphasize the immense potential of FP quantization for LLMs, paving the way for high-efficiency deployment in resource-limited settings.

Related papers

Towards Fully FP8 GEMM LLM Training at Scale [77.39425361120466]
Existing approaches often rely on suboptimal fine-grained FP8 kernels or fall back to higher-precision matrix multiplications.<n>We introduce a new class of LLM architectures that, for the first time, support FP8 computation for all GEMMs within transformer blocks during both forward and backward passes.<n>This enables unprecedented throughput gains, particularly at scale, while matching the downstream performance of standard BF16 training.
arXiv Detail & Related papers (2025-05-26T21:04:14Z)
Towards Efficient Pre-training: Exploring FP4 Precision in Large Language Models [25.700481606604647]
Experimental results demonstrate that our FP4 training scheme achieves accuracy comparable to BF16 and FP8, with smaller theoretical computational cost. With the advent of next-generation hardware supporting FP4, our method sets the foundation for efficient ultra-low precision training.
arXiv Detail & Related papers (2025-02-17T05:33:11Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE) RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers. Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
Optimizing Large Language Model Training Using FP4 Quantization [73.55459961002371]
Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce costs. This work introduces the first FP4 training framework for large language models (LLMs)
arXiv Detail & Related papers (2025-01-28T18:04:50Z)
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
We evaluate popular quantization formats across academic benchmarks and real-world tasks. We find that W4A16 offers the best costefficiency for synchronous deployments, and for asynchronous deployment on mid-tier architectures.
arXiv Detail & Related papers (2024-11-04T18:21:59Z)
SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [67.67135738642547]
Post-training quantization (PTQ) is a powerful compression technique investigated in large language models (LLMs) Existing PTQ methods are not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths. This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM.
arXiv Detail & Related papers (2024-05-23T16:21:48Z)
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving [52.31791050376249]
Quantization can accelerate large language model (LLM) inference. Existing INT4 quantization methods suffer from significant runtime overhead when dequantizing weights or partial sums. We introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache. QServe improves the maximum achievable serving of Llama-3-8B by 1.2x on A100, 1.4x on L40S; and Qwen-721.5B by 2.4x on A100, 3.5x on L40S.
arXiv Detail & Related papers (2024-05-07T17:59:30Z)
AffineQuant: Affine Transformation Quantization for Large Language Models [58.45460102764]
Post-Training Quantization (PTQ) has emerged as a subject of considerable interest due to its compression efficiency and cost-effectiveness in the context of training. Existing PTQ methods for Large-scale Language Models (LLMs) limit the optimization scope to scaling transformations between pre- and post-quantization weights. In this paper, we advocate for the direct optimization using equivalent Affine transformations in PTQ (AffineQuant)
arXiv Detail & Related papers (2024-03-19T08:40:21Z)
ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks [31.431016659268206]
This study examines 4-bit quantization methods like GPTQ in large language models (LLMs) We extend task scope to more generative categories such as code generation and abstractive summarization. We propose a novel 4+2 design for FP6 to achieve similar latency to the state-of-the-art INT4 fine-grain quantization.
arXiv Detail & Related papers (2023-12-14T01:06:37Z)
Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization [12.655230451207956]
This paper focuses on post-training quantization (PTQ) in Large Language Models (LLMs) We present two innovative techniques: activation-quantization-aware scaling (AQAS) and sequence-length-aware calibration (SLAC) We demonstrate that our techniques significantly boost task accuracies to levels comparable with full-precision models.
arXiv Detail & Related papers (2023-11-09T06:19:51Z)
LLM-FP4: 4-Bit Floating-Point Quantized Transformers [38.23587031169402]
We propose LLM-FP4 for quantizing both weights and activations in large language models (LLMs) down to 4-bit floating-point values. Compared to integer quantization, floating-point (FP) quantization is more flexible and can better handle long-tail or bell-shaped distributions. Our method, for the first time, can quantize both weights and activations in the LLaMA-13B to only 4-bit and achieves an average score of 63.1.
arXiv Detail & Related papers (2023-10-25T17:59:32Z)
FPTQ: Fine-grained Post-Training Quantization for Large Language Models [28.11564378745513]
We propose a novel W4A8 post-training quantization method for the available open-sourced LLMs. We obtain the state-of-the-art W4A8 quantized performance on BLOOM, LLaMA, and LLaMA-2 on standard benchmarks.
arXiv Detail & Related papers (2023-08-30T12:18:18Z)
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
Large language models (LLMs) have revolutionized natural language processing tasks. Recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM. We introduce an Omnidirectionally calibrated Quantization technique for LLMs, which achieves good performance in diverse quantization settings.
arXiv Detail & Related papers (2023-08-25T02:28:35Z)
Quantized Neural Networks for Low-Precision Accumulation with Guaranteed Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference. We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.