QQQ: Quality Quattuor-Bit Quantization for Large Language Models
- URL: http://arxiv.org/abs/2406.09904v3
- Date: Wed, 31 Jul 2024 07:35:42 GMT
- Title: QQQ: Quality Quattuor-Bit Quantization for Large Language Models
- Authors: Ying Zhang, Peng Zhang, Mincong Huang, Jingyang Xiang, Yujie Wang, Chao Wang, Yineng Zhang, Lei Yu, Chuan Liu, Wei Lin,
- Abstract summary: We present QQQ, a Quality Quattuor-bit Quantization method with 4-bit weights and 8-bit activations.
QQQ employs adaptive smoothing and Hessian-based compensation, significantly enhancing the performance of quantized models without extensive training.
Our specialized per-channel W4A8 GEMM and per-group W4A8 GEMM achieve impressive speed increases of 3.67$times$ and 3.29 $times$ over FP16 GEMM.
- Score: 22.61858069040346
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Quantization is a proven effective method for compressing large language models. Although popular techniques like W8A8 and W4A16 effectively maintain model performance, they often fail to concurrently speed up the prefill and decoding stages of inference. W4A8 is a promising strategy to accelerate both of them while usually leads to a significant performance degradation. To address these issues, we present QQQ, a Quality Quattuor-bit Quantization method with 4-bit weights and 8-bit activations. QQQ employs adaptive smoothing and Hessian-based compensation, significantly enhancing the performance of quantized models without extensive training. Furthermore, we meticulously engineer W4A8 GEMM kernels to increase inference speed. Our specialized per-channel W4A8 GEMM and per-group W4A8 GEMM achieve impressive speed increases of 3.67$\times$ and 3.29 $\times$ over FP16 GEMM. Our extensive experiments show that QQQ achieves performance on par with existing state-of-the-art LLM quantization methods while significantly accelerating inference, achieving speed boosts up to 2.24 $\times$, 2.10$\times$, and 1.25$\times$ compared to FP16, W8A8, and W4A16, respectively.
Related papers
- "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
We evaluate popular quantization formats across academic benchmarks and real-world tasks.
We find that W4A16 offers the best costefficiency for synchronous deployments, and for asynchronous deployment on mid-tier architectures.
arXiv Detail & Related papers (2024-11-04T18:21:59Z) - MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels.
It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup.
MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z) - ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models [9.444063879246242]
We introduce a novel arbitrary-bit quantization algorithm and inference framework, ABQ-LLM.
It achieves superior performance across various quantization settings and enables efficient arbitrary-precision quantized inference on the GPU.
arXiv Detail & Related papers (2024-08-16T06:39:08Z) - AffineQuant: Affine Transformation Quantization for Large Language Models [58.45460102764]
Post-Training Quantization (PTQ) has emerged as a subject of considerable interest due to its compression efficiency and cost-effectiveness in the context of training.
Existing PTQ methods for Large-scale Language Models (LLMs) limit the optimization scope to scaling transformations between pre- and post-quantization weights.
In this paper, we advocate for the direct optimization using equivalent Affine transformations in PTQ (AffineQuant)
arXiv Detail & Related papers (2024-03-19T08:40:21Z) - A Speed Odyssey for Deployable Quantization of LLMs [19.12232212257625]
We introduce a hardware-centric approach in the construction of quantization algorithms.
Our method, OdysseyLLM, comes with a novel W4A8 kernel implementation called FastGEMM and a combined recipe of quantization strategies.
Experiments manifest the superiority of our W4A8 method which brings the actual speed boosting up to textbf4$times$ compared to Hugging Face FP16 and textbf2.23$times$ vs. the state-of-art inference engine.
arXiv Detail & Related papers (2023-11-16T04:11:19Z) - QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language
Models [57.04178959678024]
We show that the majority of inference computations for large generative models can be performed with both weights and activations being cast to 4 bits.
We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit.
We provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x.
arXiv Detail & Related papers (2023-10-13T17:15:05Z) - Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM [6.85331857224501]
Large Language Models (LLMs) pose significant hardware challenges related to memory requirements and computational ability.
There are two mainstream quantization schemes for LLMs: coarse-grained ($textite.g.,$ channel-wise) quantization and fine-grained ($textite.g.,$ group-wise) quantization.
We introduce Dual Grained Quantization (DGQ), a novel A8W4 quantization for LLM that maintains superior performance while ensuring fast inference speed.
arXiv Detail & Related papers (2023-10-07T14:50:28Z) - FPTQ: Fine-grained Post-Training Quantization for Large Language Models [28.11564378745513]
We propose a novel W4A8 post-training quantization method for the available open-sourced LLMs.
We obtain the state-of-the-art W4A8 quantized performance on BLOOM, LLaMA, and LLaMA-2 on standard benchmarks.
arXiv Detail & Related papers (2023-08-30T12:18:18Z) - OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
Large language models (LLMs) have revolutionized natural language processing tasks.
Recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM.
We introduce an Omnidirectionally calibrated Quantization technique for LLMs, which achieves good performance in diverse quantization settings.
arXiv Detail & Related papers (2023-08-25T02:28:35Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.