ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric
Strategy for Diverse Generative Tasks
- URL: http://arxiv.org/abs/2312.08583v2
- Date: Mon, 18 Dec 2023 05:08:23 GMT
- Title: ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric
Strategy for Diverse Generative Tasks
- Authors: Xiaoxia Wu, Haojun Xia, Stephen Youn, Zhen Zheng, Shiyang Chen, Arash
Bakhtiari, Michael Wyatt, Reza Yazdani Aminabadi, Yuxiong He, Olatunji
Ruwase, Leon Song, Zhewei Yao
- Abstract summary: This study examines 4-bit quantization methods like GPTQ in large language models (LLMs)
We extend task scope to more generative categories such as code generation and abstractive summarization.
We propose a novel 4+2 design for FP6 to achieve similar latency to the state-of-the-art INT4 fine-grain quantization.
- Score: 31.431016659268206
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study examines 4-bit quantization methods like GPTQ in large language
models (LLMs), highlighting GPTQ's overfitting and limited enhancement in
Zero-Shot tasks. While prior works merely focusing on zero-shot measurement, we
extend task scope to more generative categories such as code generation and
abstractive summarization, in which we found that INT4 quantization can
significantly underperform. However, simply shifting to higher precision
formats like FP6 has been particularly challenging, thus overlooked, due to
poor performance caused by the lack of sophisticated integration and system
acceleration strategies on current AI hardware. Our results show that FP6, even
with a coarse-grain quantization scheme, performs robustly across various
algorithms and tasks, demonstrating its superiority in accuracy and
versatility. Notably, with the FP6 quantization, \codestar-15B model performs
comparably to its FP16 counterpart in code generation, and for smaller models
like the 406M it closely matches their baselines in summarization. Neither can
be achieved by INT4. To better accommodate various AI hardware and achieve the
best system performance, we propose a novel 4+2 design for FP6 to achieve
similar latency to the state-of-the-art INT4 fine-grain quantization. With our
design, FP6 can become a promising solution to the current 4-bit quantization
methods used in LLMs.
Related papers
- "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
We evaluate popular quantization formats across academic benchmarks and real-world tasks.
We find that W4A16 offers the best costefficiency for synchronous deployments, and for asynchronous deployment on mid-tier architectures.
arXiv Detail & Related papers (2024-11-04T18:21:59Z) - FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric
Algorithm-System Co-Design [30.594788583458893]
Six-bit quantization (FP6) can effectively reduce the size of large language models (LLMs)
Existing systems do not provide Core support for FP6 quantization.
We propose TCFPx, the first full-stack GPU kernel design scheme with unified Core support of float-point weights for various quantization bit-width.
arXiv Detail & Related papers (2024-01-25T11:46:38Z) - Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization [12.655230451207956]
This paper focuses on post-training quantization (PTQ) in Large Language Models (LLMs)
We present two innovative techniques: activation-quantization-aware scaling (AQAS) and sequence-length-aware calibration (SLAC)
We demonstrate that our techniques significantly boost task accuracies to levels comparable with full-precision models.
arXiv Detail & Related papers (2023-11-09T06:19:51Z) - LLM-FP4: 4-Bit Floating-Point Quantized Transformers [38.23587031169402]
We propose LLM-FP4 for quantizing both weights and activations in large language models (LLMs) down to 4-bit floating-point values.
Compared to integer quantization, floating-point (FP) quantization is more flexible and can better handle long-tail or bell-shaped distributions.
Our method, for the first time, can quantize both weights and activations in the LLaMA-13B to only 4-bit and achieves an average score of 63.1.
arXiv Detail & Related papers (2023-10-25T17:59:32Z) - QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language
Models [57.04178959678024]
We show that the majority of inference computations for large generative models can be performed with both weights and activations being cast to 4 bits.
We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit.
We provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x.
arXiv Detail & Related papers (2023-10-13T17:15:05Z) - OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
Large language models (LLMs) have revolutionized natural language processing tasks.
Recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM.
We introduce an Omnidirectionally calibrated Quantization technique for LLMs, which achieves good performance in diverse quantization settings.
arXiv Detail & Related papers (2023-08-25T02:28:35Z) - ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization
Using Floating-Point Formats [25.543571445739936]
This study explores the viability of floating-point (FP) quantization for large language models (LLMs)
For LLMs, FP8 activation consistently outshines its integer (INT8) equivalent, with the performance edge becoming more noticeable in models possessing parameters beyond one billion.
For weight quantization, our findings indicate that FP4 exhibits comparable, if not superior, performance to INT4, simplifying deployment on FP-supported hardware like H100.
arXiv Detail & Related papers (2023-07-19T06:58:03Z) - Integer or Floating Point? New Outlooks for Low-Bit Quantization on
Large Language Models [17.055400141733124]
Low-bit integer formats (e.g., INT8/INT4) have been the conventional choice for large language models (LLMs)
Low-bit floating-point formats (e.g., FP8/FP4) offer a compelling alternative and are gaining support from cutting-edge hardware, such as NVIDIA's H100 GPU.
We propose the Mixture of Formats Quantization (MoFQ), which selects the optimal format on a layer-wise basis.
arXiv Detail & Related papers (2023-05-21T05:28:37Z) - VAQF: Fully Automatic Software-hardware Co-design Framework for Low-bit
Vision Transformer [121.85581713299918]
We propose VAQF, a framework that builds inference accelerators on FPGA platforms for quantized Vision Transformers (ViTs)
Given the model structure and the desired frame rate, VAQF will automatically output the required quantization precision for activations.
This is the first time quantization has been incorporated into ViT acceleration on FPGAs.
arXiv Detail & Related papers (2022-01-17T20:27:52Z) - HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.
We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z) - Once Quantization-Aware Training: High Performance Extremely Low-bit
Architecture Search [112.05977301976613]
We propose to combine Network Architecture Search methods with quantization to enjoy the merits of the two sides.
We first propose the joint training of architecture and quantization with a shared step size to acquire a large number of quantized models.
Then a bit-inheritance scheme is introduced to transfer the quantized models to the lower bit, which further reduces the time cost and improves the quantization accuracy.
arXiv Detail & Related papers (2020-10-09T03:52:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.