Understanding INT4 Quantization for Transformer Models: Latency Speedup,
Composability, and Failure Cases
- URL: http://arxiv.org/abs/2301.12017v2
- Date: Tue, 30 May 2023 21:32:11 GMT
- Title: Understanding INT4 Quantization for Transformer Models: Latency Speedup,
Composability, and Failure Cases
- Authors: Xiaoxia Wu, Cheng Li, Reza Yazdani Aminabadi, Zhewei Yao, Yuxiong He
- Abstract summary: We show that W4A4 quantization introduces no to negligible accuracy degradation for encoder-only and encoder-decoder models, but causes a significant accuracy drop for decoder-only models.
We develop a highly optimized end-to-end W4A4 encoder inference pipeline supporting different quantization strategies.
- Score: 24.34969722921442
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Improving the deployment efficiency of transformer-based language models has
been challenging given their high computation and memory cost. While INT8
quantization has recently been shown to be effective in reducing both the
memory cost and latency while preserving model accuracy, it remains unclear
whether we can leverage INT4 (which doubles peak hardware throughput) to
achieve further latency improvement. In this study, we explore the feasibility
of employing INT4 weight and activation (W4A4) quantization for language
models. Our findings indicate that W4A4 quantization introduces no to
negligible accuracy degradation for encoder-only and encoder-decoder models,
but causes a significant accuracy drop for decoder-only models. To materialize
the performance gain using W4A4, we develop a highly optimized end-to-end W4A4
encoder inference pipeline supporting different quantization strategies. Our
INT4 pipeline is $8.5\times$ faster for latency-oriented scenarios and up to
$3\times$ for throughput-oriented scenarios compared to the inference of FP16,
and improves the SOTA BERT INT8 performance from FasterTransformer by up to
$1.7\times$. We provide insights into the failure cases when applying W4A4 to
decoder-only models, and further explore the compatibility of INT4 quantization
with other compression methods, like pruning and layer reduction.
Related papers
- HQ-DiT: Efficient Diffusion Transformer with FP4 Hybrid Quantization [10.307268005739202]
Diffusion Transformers (DiTs) have recently gained substantial attention for their superior visual generation capabilities.
DiTs also come with high parameter counts and implementation costs, seriously restricting their use on resource-limited devices such as mobile phones.
We introduce the Hybrid Floating-point Quantization for DiT(HQ-DiT), an efficient post-training quantization method that utilizes 4-bit floating-point (FP) precision on both weights and activations for DiT inference.
arXiv Detail & Related papers (2024-05-30T06:56:11Z) - PTQ4DiT: Post-training Quantization for Diffusion Transformers [52.902071948957186]
Post-training Quantization (PTQ) has emerged as a fast and data-efficient solution that can significantly reduce computation and memory footprint.
We propose PTQ4DiT, a specifically designed PTQ method for DiTs.
We demonstrate that our PTQ4DiT successfully quantizes DiTs to 8-bit precision while preserving comparable generation ability.
arXiv Detail & Related papers (2024-05-25T02:02:08Z) - QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving [52.31791050376249]
Quantization can accelerate large language model (LLM) inference.
Existing INT4 quantization methods suffer from significant runtime overhead when dequantizing weights or partial sums.
We introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache.
QServe improves the maximum achievable serving of Llama-3-8B by 1.2x on A100, 1.4x on L40S; and Qwen-721.5B by 2.4x on A100, 3.5x on L40S.
arXiv Detail & Related papers (2024-05-07T17:59:30Z) - ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric
Strategy for Diverse Generative Tasks [31.431016659268206]
This study examines 4-bit quantization methods like GPTQ in large language models (LLMs)
We extend task scope to more generative categories such as code generation and abstractive summarization.
We propose a novel 4+2 design for FP6 to achieve similar latency to the state-of-the-art INT4 fine-grain quantization.
arXiv Detail & Related papers (2023-12-14T01:06:37Z) - QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language
Models [57.04178959678024]
We show that the majority of inference computations for large generative models can be performed with both weights and activations being cast to 4 bits.
We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit.
We provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x.
arXiv Detail & Related papers (2023-10-13T17:15:05Z) - Accelerating Inference and Language Model Fusion of Recurrent Neural
Network Transducers via End-to-End 4-bit Quantization [35.198615417316056]
We report on aggressive quantization strategies that greatly accelerate inference of Recurrent Neural Network Transducers (RNN-T)
We use a 4 bit integer representation for both weights and activations and apply Quantization Aware Training (QAT) to retrain the full model.
We show that customized quantization schemes that are tailored to the local properties of the network are essential to achieve good performance.
arXiv Detail & Related papers (2022-06-16T02:17:49Z) - ZeroQuant: Efficient and Affordable Post-Training Quantization for
Large-Scale Transformers [29.566132632781848]
We present an efficient and affordable post-training quantization approach to compress large Transformer-based models, termed as ZeroQuant.
ZeroQuant is an end-to-end quantization and inference pipeline with three main components.
arXiv Detail & Related papers (2022-06-04T00:28:21Z) - Q-ASR: Integer-only Zero-shot Quantization for Efficient Speech
Recognition [65.7040645560855]
We propose Q-ASR, an integer-only, zero-shot quantization scheme for ASR models.
We show negligible WER change as compared to the full-precision baseline models.
Q-ASR exhibits a large compression rate of more than 4x with small WER degradation.
arXiv Detail & Related papers (2021-03-31T06:05:40Z) - HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.
We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z) - Towards Unified INT8 Training for Convolutional Neural Network [83.15673050981624]
We build a unified 8-bit (INT8) training framework for common convolutional neural networks.
First, we empirically find the four distinctive characteristics of gradients, which provide us insightful clues for gradient quantization.
We propose two universal techniques, including Direction Sensitive Gradient Clipping that reduces the direction deviation of gradients.
arXiv Detail & Related papers (2019-12-29T08:37:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.