EFloat: Entropy-coded Floating Point Format for Deep Learning
- URL: http://arxiv.org/abs/2102.02705v1
- Date: Thu, 4 Feb 2021 15:58:01 GMT
- Title: EFloat: Entropy-coded Floating Point Format for Deep Learning
- Authors: Rajesh Bordawekar and Bulent Abali and Ming-Hung Chen
- Abstract summary: EFloat format encodes frequent exponent values with Huffman codes to minimize the average exponent field width.
The proposed encoding concept may be beneficial to low-precision formats including 8-bit floats.
- Score: 2.3204178451683264
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We describe the EFloat floating-point number format with 4 to 6 additional
bits of precision and a wider exponent range than the existing floating point
(FP) formats of any width including FP32, BFloat16, IEEE-Half precision,
DLFloat, TensorFloat, and 8-bit floats. In a large class of deep learning
models we observe that FP exponent values tend to cluster around few unique
values which presents entropy encoding opportunities. The EFloat format encodes
frequent exponent values and signs with Huffman codes to minimize the average
exponent field width. Saved bits then become available to the mantissa
increasing the EFloat numeric precision on average by 4 to 6 bits compared to
other FP formats of equal width. The proposed encoding concept may be
beneficial to low-precision formats including 8-bit floats. Training deep
learning models with low precision arithmetic is challenging. EFloat, with its
increased precision may provide an opportunity for those tasks as well. We
currently use the EFloat format for compressing and saving memory used in large
NLP deep learning models. A potential hardware implementation for improving
PCIe and memory bandwidth limitations of AI accelerators is also discussed.
Related papers
- Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs [39.410068572891475]
Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead.
Recent works have investigated adopting 8-bit floating-point formats(FP8) in the context of PTQ for model inference.
We present minifloats, which are reduced-precision floating-point formats capable of further reducing the memory footprint, latency, and energy cost of a model.
arXiv Detail & Related papers (2023-11-21T05:27:16Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - The case for 4-bit precision: k-bit Inference Scaling Laws [75.4335600212427]
Quantization methods reduce the number of bits required to represent each parameter in a model.
The final model size depends on both the number of parameters of the original model and the rate of compression.
We run more than 35,000 zero-shot experiments with 16-bit inputs and k-bit parameters to examine which quantization methods improve scaling for 3 to 8-bit precision.
arXiv Detail & Related papers (2022-12-19T18:48:33Z) - Accuracy Booster: Enabling 4-bit Fixed-point Arithmetic for DNN Training [31.515532976570643]
We show that single-level scaling is sufficient to maintain training accuracy while maximizing arithmetic density.
We propose Accuracy Booster, a mixed-mantissa HBFP technique that uses 4-bit mantissas for over 99% of all arithmetic operations in training.
arXiv Detail & Related papers (2022-11-19T16:17:11Z) - FP8 Formats for Deep Learning [49.54015320992368]
We propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings.
E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs.
We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions.
arXiv Detail & Related papers (2022-09-12T17:39:55Z) - FP8 Quantization: The Power of the Exponent [19.179749424362686]
This paper investigates the benefit of the floating point format for neural network inference.
We detail the choices that can be made for the FP8 format, including the important choice of the number of bits for the mantissa and exponent.
We show how these findings translate to real networks, provide an efficient implementation for FP8 simulation, and a new algorithm.
arXiv Detail & Related papers (2022-08-19T09:03:00Z) - LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [80.86029795281922]
We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers.
A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
arXiv Detail & Related papers (2022-08-15T17:08:50Z) - All-You-Can-Fit 8-Bit Flexible Floating-Point Format for Accurate and
Memory-Efficient Inference of Deep Neural Networks [2.294014185517203]
This paper introduces an extremely flexible 8-bit floating-point (FFP8) format.
It achieves an extremely low accuracy loss of $0.1%sim 0.3%$ for several representative image classification models.
It is easy to turn a classical floating-point processing unit into an FFP8-compliant one, and the extra hardware cost is minor.
arXiv Detail & Related papers (2021-04-15T09:37:23Z) - Representation range needs for 16-bit neural network training [2.2657486535885094]
In floating-point arithmetic there is a tradeoff between precision and representation range as the number of exponent bits changes.
We propose a 1/6/9 format, i.e., 6-bit exponent and 9-bit explicit mantissa, that offers a better range-precision tradeoff.
We show that 1/6/9 mixed-precision training is able to speed up training on hardware that incurs a performance slowdown on denormal operations.
arXiv Detail & Related papers (2021-03-29T20:30:02Z) - HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.
We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.