FP8-BERT: Post-Training Quantization for Transformer
- URL: http://arxiv.org/abs/2312.05725v2
- Date: Tue, 12 Dec 2023 05:21:40 GMT
- Title: FP8-BERT: Post-Training Quantization for Transformer
- Authors: Jianwei Li, Tianchi Zhang, Ian En-Hsu Yen, Dongkuan Xu
- Abstract summary: Transformer-based models, such as BERT, require massive memory storage and inference cost when deployed in production.
New numeric format FP8 has been proposed and supported in commercial AI computing platforms such as H100.
We empirically validate the effectiveness of FP8 as a way to do Post-Training Quantization without significant loss of accuracy.
- Score: 20.51143486483669
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Transformer-based models, such as BERT, have been widely applied in a wide
range of natural language processing tasks. However, one inevitable side effect
is that they require massive memory storage and inference cost when deployed in
production. Quantization is one of the popularized ways to alleviate the cost.
However, the previous 8-bit quantization strategy based on INT8 data format
either suffers from the degradation of accuracy in a Post-Training Quantization
(PTQ) fashion or requires an expensive Quantization-Aware Training (QAT)
process. Recently, a new numeric format FP8 (i.e. floating-point of 8-bits) has
been proposed and supported in commercial AI computing platforms such as H100.
In this paper, we empirically validate the effectiveness of FP8 as a way to do
Post-Training Quantization without significant loss of accuracy, with a simple
calibration and format conversion process. We adopt the FP8 standard proposed
by NVIDIA Corp. (2022) in our extensive experiments of BERT variants on GLUE
and SQuAD v1.1 datasets, and show that PTQ with FP8 can significantly improve
the accuracy upon that with INT8, to the extent of the full-precision model.
Related papers
- "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
We evaluate popular quantization formats across academic benchmarks and real-world tasks.
We find that W4A16 offers the best costefficiency for synchronous deployments, and for asynchronous deployment on mid-tier architectures.
arXiv Detail & Related papers (2024-11-04T18:21:59Z) - Test-Time Model Adaptation with Only Forward Passes [68.11784295706995]
Test-time adaptation has proven effective in adapting a given trained model to unseen test samples with potential distribution shifts.
We propose a test-time Forward-Optimization Adaptation (FOA) method.
FOA runs on quantized 8-bit ViT, outperforms gradient-based TENT on full-precision 32-bit ViT, and achieves an up to 24-fold memory reduction on ImageNet-C.
arXiv Detail & Related papers (2024-04-02T05:34:33Z) - FP8-LM: Training FP8 Large Language Models [47.17804713425323]
In this paper, we propose a new FP8 automatic mixed-precision framework for training large language models.
Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 39% reduction in real memory usage but also ran 75% faster than the widely adopted BF16 framework.
arXiv Detail & Related papers (2023-10-27T17:59:51Z) - Training and inference of large language models using 8-bit floating
point [3.689110902209004]
This paper presents a methodology to select the scalings for FP8 linear layers, based on dynamically updating per-tensor scales for the weights, gradients and activations.
We apply this methodology to train and validate large language models of the type of GPT and Llama 2 using FP8, for model sizes ranging from 111M to 70B.
arXiv Detail & Related papers (2023-09-29T13:24:33Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - Unit Scaling: Out-of-the-Box Low-Precision Training [1.7188280334580197]
Unit scaling is a paradigm for designing deep learning models that simplifies the use of low-precision number formats.
Training in FP16 or the recently proposed FP8 formats offers substantial efficiency gains, but can lack sufficient range for out-of-the-box training.
Unit scaling addresses this by introducing a principled approach to model numerics: seeking unit variance of all weights, activations and gradients at initialisation.
arXiv Detail & Related papers (2023-03-20T16:42:25Z) - FP8 Formats for Deep Learning [49.54015320992368]
We propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings.
E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs.
We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions.
arXiv Detail & Related papers (2022-09-12T17:39:55Z) - LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [80.86029795281922]
We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers.
A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
arXiv Detail & Related papers (2022-08-15T17:08:50Z) - Learning Accurate Integer Transformer Machine-Translation Models [0.05184427980355132]
We describe a method for training accurate Transformer machine-translation models to run inference using 8-bit integer (INT8) hardware matrix multipliers.
Our approach converts all matrix-multiplication tensors from an existing FP32 model into INT8 tensors by automatically making range-precision trade-offs during training.
arXiv Detail & Related papers (2020-01-03T18:40:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.