LLM-FP4: 4-Bit Floating-Point Quantized Transformers
- URL: http://arxiv.org/abs/2310.16836v1
- Date: Wed, 25 Oct 2023 17:59:32 GMT
- Title: LLM-FP4: 4-Bit Floating-Point Quantized Transformers
- Authors: Shih-yang Liu, Zechun Liu, Xijie Huang, Pingcheng Dong, Kwang-Ting
Cheng
- Abstract summary: We propose LLM-FP4 for quantizing both weights and activations in large language models (LLMs) down to 4-bit floating-point values.
Compared to integer quantization, floating-point (FP) quantization is more flexible and can better handle long-tail or bell-shaped distributions.
Our method, for the first time, can quantize both weights and activations in the LLaMA-13B to only 4-bit and achieves an average score of 63.1.
- Score: 38.23587031169402
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose LLM-FP4 for quantizing both weights and activations in large
language models (LLMs) down to 4-bit floating-point values, in a post-training
manner. Existing post-training quantization (PTQ) solutions are primarily
integer-based and struggle with bit widths below 8 bits. Compared to integer
quantization, floating-point (FP) quantization is more flexible and can better
handle long-tail or bell-shaped distributions, and it has emerged as a default
choice in many hardware platforms. One characteristic of FP quantization is
that its performance largely depends on the choice of exponent bits and
clipping range. In this regard, we construct a strong FP-PTQ baseline by
searching for the optimal quantization parameters. Furthermore, we observe a
high inter-channel variance and low intra-channel variance pattern in
activation distributions, which adds activation quantization difficulty. We
recognize this pattern to be consistent across a spectrum of transformer models
designed for diverse tasks, such as LLMs, BERT, and Vision Transformer models.
To tackle this, we propose per-channel activation quantization and show that
these additional scaling factors can be reparameterized as exponential biases
of weights, incurring a negligible cost. Our method, for the first time, can
quantize both weights and activations in the LLaMA-13B to only 4-bit and
achieves an average score of 63.1 on the common sense zero-shot reasoning
tasks, which is only 5.8 lower than the full-precision model, significantly
outperforming the previous state-of-the-art by 12.7 points. Code is available
at: https://github.com/nbasyl/LLM-FP4.
Related papers
- "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
We evaluate popular quantization formats across academic benchmarks and real-world tasks.
We find that W4A16 offers the best costefficiency for synchronous deployments, and for asynchronous deployment on mid-tier architectures.
arXiv Detail & Related papers (2024-11-04T18:21:59Z) - HQ-DiT: Efficient Diffusion Transformer with FP4 Hybrid Quantization [10.307268005739202]
Diffusion Transformers (DiTs) have recently gained substantial attention for their superior visual generation capabilities.
DiTs also come with high parameter counts and implementation costs, seriously restricting their use on resource-limited devices such as mobile phones.
We introduce the Hybrid Floating-point Quantization for DiT(HQ-DiT), an efficient post-training quantization method that utilizes 4-bit floating-point (FP) precision on both weights and activations for DiT inference.
arXiv Detail & Related papers (2024-05-30T06:56:11Z) - Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization [62.15918574997175]
It is known that language models contain outlier channels whose values on average are orders of magnitude higher than other channels.
We propose a strategy which regularizes a layer's inputs via quantization-aware training (QAT) and its outputs via activation kurtosis regularization.
We show that regularizing both the inputs and outputs is crucial for preventing a model's "migrating" the difficulty in input quantization to the weights.
arXiv Detail & Related papers (2024-04-04T17:25:30Z) - AffineQuant: Affine Transformation Quantization for Large Language Models [58.45460102764]
Post-Training Quantization (PTQ) has emerged as a subject of considerable interest due to its compression efficiency and cost-effectiveness in the context of training.
Existing PTQ methods for Large-scale Language Models (LLMs) limit the optimization scope to scaling transformations between pre- and post-quantization weights.
In this paper, we advocate for the direct optimization using equivalent Affine transformations in PTQ (AffineQuant)
arXiv Detail & Related papers (2024-03-19T08:40:21Z) - ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric
Strategy for Diverse Generative Tasks [31.431016659268206]
This study examines 4-bit quantization methods like GPTQ in large language models (LLMs)
We extend task scope to more generative categories such as code generation and abstractive summarization.
We propose a novel 4+2 design for FP6 to achieve similar latency to the state-of-the-art INT4 fine-grain quantization.
arXiv Detail & Related papers (2023-12-14T01:06:37Z) - FPTQ: Fine-grained Post-Training Quantization for Large Language Models [28.11564378745513]
We propose a novel W4A8 post-training quantization method for the available open-sourced LLMs.
We obtain the state-of-the-art W4A8 quantized performance on BLOOM, LLaMA, and LLaMA-2 on standard benchmarks.
arXiv Detail & Related papers (2023-08-30T12:18:18Z) - OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
Large language models (LLMs) have revolutionized natural language processing tasks.
Recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM.
We introduce an Omnidirectionally calibrated Quantization technique for LLMs, which achieves good performance in diverse quantization settings.
arXiv Detail & Related papers (2023-08-25T02:28:35Z) - ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization
Using Floating-Point Formats [25.543571445739936]
This study explores the viability of floating-point (FP) quantization for large language models (LLMs)
For LLMs, FP8 activation consistently outshines its integer (INT8) equivalent, with the performance edge becoming more noticeable in models possessing parameters beyond one billion.
For weight quantization, our findings indicate that FP4 exhibits comparable, if not superior, performance to INT4, simplifying deployment on FP-supported hardware like H100.
arXiv Detail & Related papers (2023-07-19T06:58:03Z) - Integer or Floating Point? New Outlooks for Low-Bit Quantization on
Large Language Models [17.055400141733124]
Low-bit integer formats (e.g., INT8/INT4) have been the conventional choice for large language models (LLMs)
Low-bit floating-point formats (e.g., FP8/FP4) offer a compelling alternative and are gaining support from cutting-edge hardware, such as NVIDIA's H100 GPU.
We propose the Mixture of Formats Quantization (MoFQ), which selects the optimal format on a layer-wise basis.
arXiv Detail & Related papers (2023-05-21T05:28:37Z) - Distribution-Flexible Subset Quantization for Post-Quantizing
Super-Resolution Networks [68.83451203841624]
This paper introduces Distribution-Flexible Subset Quantization (DFSQ), a post-training quantization method for super-resolution networks.
DFSQ conducts channel-wise normalization of the activations and applies distribution-flexible subset quantization (SQ)
It achieves comparable performance to full-precision counterparts on 6- and 8-bit quantization, and incurs only a 0.1 dB PSNR drop on 4-bit quantization.
arXiv Detail & Related papers (2023-05-10T04:19:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.