ZeroQuant-HERO: Hardware-Enhanced Robust Optimized Post-Training
Quantization Framework for W8A8 Transformers
- URL: http://arxiv.org/abs/2310.17723v1
- Date: Thu, 26 Oct 2023 18:34:41 GMT
- Title: ZeroQuant-HERO: Hardware-Enhanced Robust Optimized Post-Training
Quantization Framework for W8A8 Transformers
- Authors: Zhewei Yao, Reza Yazdani Aminabadi, Stephen Youn, Xiaoxia Wu, Elton
Zheng, Yuxiong He
- Abstract summary: Quantization techniques are pivotal in reducing the memory and computational demands of deep neural network inference.
Existing solutions, such as ZeroQuant, offer dynamic quantization for models like BERT and GPT but overlook crucial memory-bounded operators and the complexities of per-token quantization.
We present a novel, fully hardware-enhanced robust optimized post-training W8A8 quantization framework, ZeroQuant-HERO.
- Score: 38.03919998600518
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Quantization techniques are pivotal in reducing the memory and computational
demands of deep neural network inference. Existing solutions, such as
ZeroQuant, offer dynamic quantization for models like BERT and GPT but overlook
crucial memory-bounded operators and the complexities of per-token
quantization. Addressing these gaps, we present a novel, fully
hardware-enhanced robust optimized post-training W8A8 quantization framework,
ZeroQuant-HERO. This framework uniquely integrates both memory bandwidth and
compute-intensive operators, aiming for optimal hardware performance.
Additionally, it offers flexibility by allowing specific INT8 modules to switch
to FP16/BF16 mode, enhancing accuracy.
Related papers
- "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
We evaluate popular quantization formats across academic benchmarks and real-world tasks.
We find that W4A16 offers the best costefficiency for synchronous deployments, and for asynchronous deployment on mid-tier architectures.
arXiv Detail & Related papers (2024-11-04T18:21:59Z) - ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models [9.444063879246242]
We introduce a novel arbitrary-bit quantization algorithm and inference framework, ABQ-LLM.
It achieves superior performance across various quantization settings and enables efficient arbitrary-precision quantized inference on the GPU.
arXiv Detail & Related papers (2024-08-16T06:39:08Z) - OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
Large language models (LLMs) have revolutionized natural language processing tasks.
Recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM.
We introduce an Omnidirectionally calibrated Quantization technique for LLMs, which achieves good performance in diverse quantization settings.
arXiv Detail & Related papers (2023-08-25T02:28:35Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - Accelerating RNN-based Speech Enhancement on a Multi-Core MCU with Mixed
FP16-INT8 Post-Training Quantization [0.0]
Speech Enhancement (SE) algorithms based on Recurrent Neural Networks (RNNs) are deployed on a state-of-the-art MicroController Unit (MCU)
We propose an optimized software pipeline interleaving parallel computation of LSTM or GRU recurrent blocks with manually-managed memory transfers.
Experiments are conducted on multiple LSTM and GRU based SE models trained on the Valentini dataset, featuring up to 1.24M parameters.
arXiv Detail & Related papers (2022-10-14T10:32:05Z) - ZeroQuant: Efficient and Affordable Post-Training Quantization for
Large-Scale Transformers [29.566132632781848]
We present an efficient and affordable post-training quantization approach to compress large Transformer-based models, termed as ZeroQuant.
ZeroQuant is an end-to-end quantization and inference pipeline with three main components.
arXiv Detail & Related papers (2022-06-04T00:28:21Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - Leveraging Automated Mixed-Low-Precision Quantization for tiny edge
microcontrollers [76.30674794049293]
This paper presents an automated mixed-precision quantization flow based on the HAQ framework but tailored for the memory and computational characteristics of MCU devices.
Specifically, a Reinforcement Learning agent searches for the best uniform quantization levels, among 2, 4, 8 bits, of individual weight and activation tensors.
Given an MCU-class memory bound to 2MB for weight-only quantization, the compressed models produced by the mixed-precision engine result as accurate as the state-of-the-art solutions.
arXiv Detail & Related papers (2020-08-12T06:09:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.