Efficient Post-training Quantization with FP8 Formats
- URL: http://arxiv.org/abs/2309.14592v2
- Date: Sun, 31 Mar 2024 23:05:53 GMT
- Title: Efficient Post-training Quantization with FP8 Formats
- Authors: Haihao Shen, Naveen Mellempudi, Xin He, Qun Gao, Chang Wang, Mengni Wang,
- Abstract summary: We study the advantages of FP8 data formats for post-training quantization across 75 unique network architectures.
E4M3 is better suited for NLP models, whereas E3M4 performs marginally better than E4M3 on computer vision tasks.
- Score: 14.543387418837154
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in deep learning methods such as LLMs and Diffusion models have created a need for improved quantization methods that can meet the computational demands of these modern architectures while maintaining accuracy. Towards this goal, we study the advantages of FP8 data formats for post-training quantization across 75 unique network architectures covering a wide range of tasks, including machine translation, language modeling, text generation, image classification, generation, and segmentation. We examine three different FP8 representations (E5M2, E4M3, and E3M4) to study the effects of varying degrees of trade-off between dynamic range and precision on model accuracy. Based on our extensive study, we developed a quantization workflow that generalizes across different network architectures. Our empirical results show that FP8 formats outperform INT8 in multiple aspects, including workload coverage (92.64% vs. 65.87%), model accuracy and suitability for a broader range of operations. Furthermore, our findings suggest that E4M3 is better suited for NLP models, whereas E3M4 performs marginally better than E4M3 on computer vision tasks. The code is publicly available on Intel Neural Compressor: https://github.com/intel/neural-compressor.
Related papers
- Mixture of Experts with Mixture of Precisions for Tuning Quality of Service [0.0]
This paper presents an adaptive serving approach for the efficient deployment of MoE models.
By dynamically determining the number of quantized experts, we offer a fine-grained range of configurations for tuning throughput and model quality.
Results highlight the practical applicability of our approach in dynamic and accuracy-sensitive applications.
arXiv Detail & Related papers (2024-07-19T15:42:49Z) - FP8-LM: Training FP8 Large Language Models [47.17804713425323]
In this paper, we propose a new FP8 automatic mixed-precision framework for training large language models.
Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 39% reduction in real memory usage but also ran 75% faster than the widely adopted BF16 framework.
arXiv Detail & Related papers (2023-10-27T17:59:51Z) - ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization
Using Floating-Point Formats [25.543571445739936]
This study explores the viability of floating-point (FP) quantization for large language models (LLMs)
For LLMs, FP8 activation consistently outshines its integer (INT8) equivalent, with the performance edge becoming more noticeable in models possessing parameters beyond one billion.
For weight quantization, our findings indicate that FP4 exhibits comparable, if not superior, performance to INT4, simplifying deployment on FP-supported hardware like H100.
arXiv Detail & Related papers (2023-07-19T06:58:03Z) - An Empirical Study of Multimodal Model Merging [148.48412442848795]
Model merging is a technique that fuses multiple models trained on different tasks to generate a multi-task solution.
We conduct our study for a novel goal where we can merge vision, language, and cross-modal transformers of a modality-specific architecture.
We propose two metrics that assess the distance between weights to be merged and can serve as an indicator of the merging outcomes.
arXiv Detail & Related papers (2023-04-28T15:43:21Z) - Modular Quantization-Aware Training: Increasing Accuracy by Decreasing
Precision in 6D Object Pose Estimation [56.80039657816035]
Edge applications demand efficient 6D object pose estimation on resource-constrained embedded platforms.
We introduce Modular Quantization-Aware Training (MQAT), an adaptive and mixed-precision quantization-aware training strategy.
MQAT guides a systematic gradated modular quantization sequence and determines module-specific bit precisions, leading to quantized models that outperform those produced by state-of-the-art uniform and mixed-precision quantization techniques.
arXiv Detail & Related papers (2023-03-12T21:01:54Z) - FP8 Formats for Deep Learning [49.54015320992368]
We propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings.
E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs.
We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions.
arXiv Detail & Related papers (2022-09-12T17:39:55Z) - FxP-QNet: A Post-Training Quantizer for the Design of Mixed
Low-Precision DNNs with Dynamic Fixed-Point Representation [2.4149105714758545]
We propose a novel framework referred to as the Fixed-Point Quantizer of deep neural Networks (FxP-QNet)
FxP-QNet adapts the quantization level for each data-structure of each layer based on the trade-off between the network accuracy and the low-precision requirements.
Results show that FxP-QNet-quantized AlexNet, VGG-16, and ResNet-18 reduce the overall memory requirements of their full-precision counterparts by 7.16x, 10.36x, and 6.44x with less than 0.95%, 0.95%, and 1.99%
arXiv Detail & Related papers (2022-03-22T23:01:43Z) - Greedy Network Enlarging [53.319011626986004]
We propose a greedy network enlarging method based on the reallocation of computations.
With step-by-step modifying the computations on different stages, the enlarged network will be equipped with optimal allocation and utilization of MACs.
With application of our method on GhostNet, we achieve state-of-the-art 80.9% and 84.3% ImageNet top-1 accuracies.
arXiv Detail & Related papers (2021-07-31T08:36:30Z) - ANNETTE: Accurate Neural Network Execution Time Estimation with Stacked
Models [56.21470608621633]
We propose a time estimation framework to decouple the architectural search from the target hardware.
The proposed methodology extracts a set of models from micro- kernel and multi-layer benchmarks and generates a stacked model for mapping and network execution time estimation.
We compare estimation accuracy and fidelity of the generated mixed models, statistical models with the roofline model, and a refined roofline model for evaluation.
arXiv Detail & Related papers (2021-05-07T11:39:05Z) - Degree-Quant: Quantization-Aware Training for Graph Neural Networks [10.330195866109312]
Graph neural networks (GNNs) have demonstrated strong performance on a wide variety of tasks.
Despite their promise, there exists little research exploring methods to make them more efficient at inference time.
We propose an architecturally-agnostic method, Degree-Quant, to improve performance over existing quantization-aware training baselines.
arXiv Detail & Related papers (2020-08-11T20:53:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.