ZeroQuant: Efficient and Affordable Post-Training Quantization for
Large-Scale Transformers
- URL: http://arxiv.org/abs/2206.01861v1
- Date: Sat, 4 Jun 2022 00:28:21 GMT
- Title: ZeroQuant: Efficient and Affordable Post-Training Quantization for
Large-Scale Transformers
- Authors: Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong
Li, Yuxiong He
- Abstract summary: We present an efficient and affordable post-training quantization approach to compress large Transformer-based models, termed as ZeroQuant.
ZeroQuant is an end-to-end quantization and inference pipeline with three main components.
- Score: 29.566132632781848
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How to efficiently serve ever-larger trained natural language models in
practice has become exceptionally challenging even for powerful cloud servers
due to their prohibitive memory/computation requirements. In this work, we
present an efficient and affordable post-training quantization approach to
compress large Transformer-based models, termed as ZeroQuant. ZeroQuant is an
end-to-end quantization and inference pipeline with three main components: (1)
a fine-grained hardware-friendly quantization scheme for both weight and
activations; (2) a novel affordable layer-by-layer knowledge distillation
algorithm (LKD) even without the access to the original training data; (3) a
highly-optimized quantization system backend support to remove the
quantization/dequantization overhead. As such, we are able to show that: (1)
ZeroQuant can reduce the precision for weights and activations to INT8 in a
cost-free way for both BERT and GPT3-style models with minimal accuracy impact,
which leads to up to 5.19x/4.16x speedup on those models compared to FP16
inference; (2) ZeroQuant plus LKD affordably quantize the weights in the
fully-connected module to INT4 along with INT8 weights in the attention module
and INT8 activations, resulting in 3x memory footprint reduction compared to
the FP16 model; (3) ZeroQuant can be directly applied to two of the largest
open-sourced language models, including GPT-J6B and GPT-NeoX20, for which our
INT8 model achieves similar accuracy as the FP16 model but achieves up to 5.2x
better efficiency.
Related papers
- "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
We evaluate popular quantization formats across academic benchmarks and real-world tasks.
We find that W4A16 offers the best costefficiency for synchronous deployments, and for asynchronous deployment on mid-tier architectures.
arXiv Detail & Related papers (2024-11-04T18:21:59Z) - ZeroQuant-HERO: Hardware-Enhanced Robust Optimized Post-Training
Quantization Framework for W8A8 Transformers [38.03919998600518]
Quantization techniques are pivotal in reducing the memory and computational demands of deep neural network inference.
Existing solutions, such as ZeroQuant, offer dynamic quantization for models like BERT and GPT but overlook crucial memory-bounded operators and the complexities of per-token quantization.
We present a novel, fully hardware-enhanced robust optimized post-training W8A8 quantization framework, ZeroQuant-HERO.
arXiv Detail & Related papers (2023-10-26T18:34:41Z) - LLM-FP4: 4-Bit Floating-Point Quantized Transformers [38.23587031169402]
We propose LLM-FP4 for quantizing both weights and activations in large language models (LLMs) down to 4-bit floating-point values.
Compared to integer quantization, floating-point (FP) quantization is more flexible and can better handle long-tail or bell-shaped distributions.
Our method, for the first time, can quantize both weights and activations in the LLaMA-13B to only 4-bit and achieves an average score of 63.1.
arXiv Detail & Related papers (2023-10-25T17:59:32Z) - QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language
Models [57.04178959678024]
We show that the majority of inference computations for large generative models can be performed with both weights and activations being cast to 4 bits.
We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit.
We provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x.
arXiv Detail & Related papers (2023-10-13T17:15:05Z) - OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
Large language models (LLMs) have revolutionized natural language processing tasks.
Recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM.
We introduce an Omnidirectionally calibrated Quantization technique for LLMs, which achieves good performance in diverse quantization settings.
arXiv Detail & Related papers (2023-08-25T02:28:35Z) - ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization
Using Floating-Point Formats [25.543571445739936]
This study explores the viability of floating-point (FP) quantization for large language models (LLMs)
For LLMs, FP8 activation consistently outshines its integer (INT8) equivalent, with the performance edge becoming more noticeable in models possessing parameters beyond one billion.
For weight quantization, our findings indicate that FP4 exhibits comparable, if not superior, performance to INT4, simplifying deployment on FP-supported hardware like H100.
arXiv Detail & Related papers (2023-07-19T06:58:03Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.