Towards Fully 8-bit Integer Inference for the Transformer Model
- URL: http://arxiv.org/abs/2009.08034v2
- Date: Fri, 18 Sep 2020 06:12:27 GMT
- Title: Towards Fully 8-bit Integer Inference for the Transformer Model
- Authors: Ye Lin, Yanyang Li, Tengbo Liu, Tong Xiao, Tongran Liu and Jingbo Zhu
- Abstract summary: We show that after a principled modification on the Transformer architecture, dubbed Transformer, an (almost) fully 8-bit integer inference algorithm could be derived.
Our experiments on WMT16 En->Ro, WMT14 En->De and En->Fr translation tasks as well as the WikiText-103 language modelling task show that the fully 8-bit Transformer system achieves comparable performance with the floating point baseline but requires nearly 4x less memory footprint.
- Score: 39.22272841663168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 8-bit integer inference, as a promising direction in reducing both the
latency and storage of deep neural networks, has made great progress recently.
On the other hand, previous systems still rely on 32-bit floating point for
certain functions in complex models (e.g., Softmax in Transformer), and make
heavy use of quantization and de-quantization. In this work, we show that after
a principled modification on the Transformer architecture, dubbed Integer
Transformer, an (almost) fully 8-bit integer inference algorithm Scale
Propagation could be derived. De-quantization is adopted when necessary, which
makes the network more efficient. Our experiments on WMT16 En<->Ro, WMT14
En<->De and En->Fr translation tasks as well as the WikiText-103 language
modelling task show that the fully 8-bit Transformer system achieves comparable
performance with the floating point baseline but requires nearly 4x less memory
footprint.
Related papers
- Integer Fine-tuning of Transformer-based Models [13.383066080742699]
We study the effect of various integer bit-widths to find the minimum required bit-width for integer fine-tuning of transformer-based models.
We show that 16-bit integer models match the floating-point baseline performance.
Further reduction of the bit-width to 8 provides an average score drop of 1.7 points.
arXiv Detail & Related papers (2022-09-20T16:02:28Z) - FP8 Formats for Deep Learning [49.54015320992368]
We propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings.
E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs.
We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions.
arXiv Detail & Related papers (2022-09-12T17:39:55Z) - LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [80.86029795281922]
We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers.
A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
arXiv Detail & Related papers (2022-08-15T17:08:50Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - I-BERT: Integer-only BERT Quantization [78.43819756382103]
We propose I-BERT, a novel quantization scheme for Transformer based models.
I-BERT performs an end-to-end integer-only BERT inference without any floating point calculation.
We show that for both cases, I-BERT achieves similar (and slightly higher) accuracy as compared to the full-precision baseline.
arXiv Detail & Related papers (2021-01-05T02:42:58Z) - HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.
We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z) - Shifted and Squeezed 8-bit Floating Point format for Low-Precision
Training of Deep Neural Networks [13.929168096016957]
We introduce a novel methodology for training deep neural networks using 8-bit floating point (FP8) numbers.
Reduced bit precision allows for a larger effective memory and increased computational speed.
We show that, unlike previous 8-bit precision training methods, the proposed method works out-of-the-box for representative models.
arXiv Detail & Related papers (2020-01-16T06:38:27Z) - Learning Accurate Integer Transformer Machine-Translation Models [0.05184427980355132]
We describe a method for training accurate Transformer machine-translation models to run inference using 8-bit integer (INT8) hardware matrix multipliers.
Our approach converts all matrix-multiplication tensors from an existing FP32 model into INT8 tensors by automatically making range-precision trade-offs during training.
arXiv Detail & Related papers (2020-01-03T18:40:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.