FP8 Formats for Deep Learning
- URL: http://arxiv.org/abs/2209.05433v1
- Date: Mon, 12 Sep 2022 17:39:55 GMT
- Title: FP8 Formats for Deep Learning
- Authors: Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea,
Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick
Judd, John Kamalu, Naveen Mellempudi, Stuart Oberman, Mohammad Shoeybi,
Michael Siu, Hao Wu
- Abstract summary: We propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings.
E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs.
We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions.
- Score: 49.54015320992368
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: FP8 is a natural progression for accelerating deep learning training
inference beyond the 16-bit formats common in modern processors. In this paper
we propose an 8-bit floating point (FP8) binary interchange format consisting
of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit
exponent and 2-bit mantissa). While E5M2 follows IEEE 754 conventions for
representatio of special values, E4M3's dynamic range is extended by not
representing infinities and having only one mantissa bit-pattern for NaNs. We
demonstrate the efficacy of the FP8 format on a variety of image and language
tasks, effectively matching the result quality achieved by 16-bit training
sessions. Our study covers the main modern neural network architectures - CNNs,
RNNs, and Transformer-based models, leaving all the hyperparameters unchanged
from the 16-bit baseline training sessions. Our training experiments include
large, up to 175B parameter, language models. We also examine FP8
post-training-quantization of language models trained using 16-bit formats that
resisted fixed point int8 quantization.
Related papers
- "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
We evaluate popular quantization formats across academic benchmarks and real-world tasks.
We find that W4A16 offers the best costefficiency for synchronous deployments, and for asynchronous deployment on mid-tier architectures.
arXiv Detail & Related papers (2024-11-04T18:21:59Z) - Efficient Post-training Quantization with FP8 Formats [14.543387418837154]
We study the advantages of FP8 data formats for post-training quantization across 75 unique network architectures.
E4M3 is better suited for NLP models, whereas E3M4 performs marginally better than E4M3 on computer vision tasks.
arXiv Detail & Related papers (2023-09-26T00:58:36Z) - FP8 versus INT8 for efficient deep learning inference [14.98281493168929]
We compare the performance for both the FP8 and INT formats for efficient on-device inference.
We show that the FP formats are somewhere between 50-180% less efficient in terms of compute in dedicated hardware than the INT format.
We conclude that although the proposed FP8 format could be good for training, the results for inference do not warrant a dedicated implementation of FP8.
arXiv Detail & Related papers (2023-03-31T10:29:17Z) - The case for 4-bit precision: k-bit Inference Scaling Laws [75.4335600212427]
Quantization methods reduce the number of bits required to represent each parameter in a model.
The final model size depends on both the number of parameters of the original model and the rate of compression.
We run more than 35,000 zero-shot experiments with 16-bit inputs and k-bit parameters to examine which quantization methods improve scaling for 3 to 8-bit precision.
arXiv Detail & Related papers (2022-12-19T18:48:33Z) - FP8 Quantization: The Power of the Exponent [19.179749424362686]
This paper investigates the benefit of the floating point format for neural network inference.
We detail the choices that can be made for the FP8 format, including the important choice of the number of bits for the mantissa and exponent.
We show how these findings translate to real networks, provide an efficient implementation for FP8 simulation, and a new algorithm.
arXiv Detail & Related papers (2022-08-19T09:03:00Z) - LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [80.86029795281922]
We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers.
A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
arXiv Detail & Related papers (2022-08-15T17:08:50Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - Representation range needs for 16-bit neural network training [2.2657486535885094]
In floating-point arithmetic there is a tradeoff between precision and representation range as the number of exponent bits changes.
We propose a 1/6/9 format, i.e., 6-bit exponent and 9-bit explicit mantissa, that offers a better range-precision tradeoff.
We show that 1/6/9 mixed-precision training is able to speed up training on hardware that incurs a performance slowdown on denormal operations.
arXiv Detail & Related papers (2021-03-29T20:30:02Z) - HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.
We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z) - Towards Fully 8-bit Integer Inference for the Transformer Model [39.22272841663168]
We show that after a principled modification on the Transformer architecture, dubbed Transformer, an (almost) fully 8-bit integer inference algorithm could be derived.
Our experiments on WMT16 En->Ro, WMT14 En->De and En->Fr translation tasks as well as the WikiText-103 language modelling task show that the fully 8-bit Transformer system achieves comparable performance with the floating point baseline but requires nearly 4x less memory footprint.
arXiv Detail & Related papers (2020-09-17T03:09:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.