Related papers: EFloat: Entropy-coded Floating Point Format for Deep Learning

EFloat: Entropy-coded Floating Point Format for Deep Learning

URL: http://arxiv.org/abs/2102.02705v1
Date: Thu, 4 Feb 2021 15:58:01 GMT
Title: EFloat: Entropy-coded Floating Point Format for Deep Learning
Authors: Rajesh Bordawekar and Bulent Abali and Ming-Hung Chen
Abstract summary: EFloat format encodes frequent exponent values with Huffman codes to minimize the average exponent field width. The proposed encoding concept may be beneficial to low-precision formats including 8-bit floats.
Score: 2.3204178451683264
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We describe the EFloat floating-point number format with 4 to 6 additional bits of precision and a wider exponent range than the existing floating point (FP) formats of any width including FP32, BFloat16, IEEE-Half precision, DLFloat, TensorFloat, and 8-bit floats. In a large class of deep learning models we observe that FP exponent values tend to cluster around few unique values which presents entropy encoding opportunities. The EFloat format encodes frequent exponent values and signs with Huffman codes to minimize the average exponent field width. Saved bits then become available to the mantissa increasing the EFloat numeric precision on average by 4 to 6 bits compared to other FP formats of equal width. The proposed encoding concept may be beneficial to low-precision formats including 8-bit floats. Training deep learning models with low precision arithmetic is challenging. EFloat, with its increased precision may provide an opportunity for those tasks as well. We currently use the EFloat format for compressing and saving memory used in large NLP deep learning models. A potential hardware implementation for improving PCIe and memory bandwidth limitations of AI accelerators is also discussed.

Related papers

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float [71.43026659686679]
Large Language Models (LLMs) have grown rapidly in size, creating challenges for efficient deployment on resource-constrained hardware. We introduce Dynamic-Length Float (DFloat11), a compression framework that reduces LLM size by 30% while preserving outputs that are bit-for-bit identical to the original model.
arXiv Detail & Related papers (2025-04-15T22:38:38Z)
MGS: Markov Greedy Sums for Accurate Low-Bitwidth Floating-Point Accumulation [3.638431342539701]
MGS (Markov Greedy Sums) is a novel approach to improve the accuracy of low-bitwidth floating-point dot products in neural network computations. We design, analyze, and implement the algorithm to minimize 8-bit floating point error at inference time for several neural networks.
arXiv Detail & Related papers (2025-04-12T04:19:03Z)
FoNE: Precise Single-Token Number Embeddings via Fourier Features [51.17846016593835]
We propose a novel method that maps numbers into the embedding space with their Fourier features. FoNE encodes each number as a single token with only two embedding dimensions per digit, effectively capturing numerical values without fragmentation. On 6-digit decimal addition, FoNE requires 64$times$ less data to achieve 99% accuracy than subword and digit-wise embeddings. FoNE is the only method that yields 100% accuracy on over 100,000 test examples for addition, subtraction, and multiplication.
arXiv Detail & Related papers (2025-02-13T19:54:59Z)
Scaling Laws for Floating Point Quantization Training [47.174957621592775]
Low-precision training is considered an effective strategy for reducing both training and downstream inference costs. In this paper, we thoroughly explore the effects of floating-point quantization targets, exponent bits, mantissa bits, and the calculation of the scaling factor in floating-point quantization training performance.
arXiv Detail & Related papers (2025-01-05T02:30:41Z)
Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs [39.410068572891475]
Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point formats(FP8) in the context of PTQ for model inference. We present minifloats, which are reduced-precision floating-point formats capable of further reducing the memory footprint, latency, and energy cost of a model.
arXiv Detail & Related papers (2023-11-21T05:27:16Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
The case for 4-bit precision: k-bit Inference Scaling Laws [75.4335600212427]
Quantization methods reduce the number of bits required to represent each parameter in a model. The final model size depends on both the number of parameters of the original model and the rate of compression. We run more than 35,000 zero-shot experiments with 16-bit inputs and k-bit parameters to examine which quantization methods improve scaling for 3 to 8-bit precision.
arXiv Detail & Related papers (2022-12-19T18:48:33Z)
Accuracy Booster: Enabling 4-bit Fixed-point Arithmetic for DNN Training [31.515532976570643]
We show that single-level scaling is sufficient to maintain training accuracy while maximizing arithmetic density. We propose Accuracy Booster, a mixed-mantissa HBFP technique that uses 4-bit mantissas for over 99% of all arithmetic operations in training.
arXiv Detail & Related papers (2022-11-19T16:17:11Z)
FP8 Formats for Deep Learning [49.54015320992368]
We propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings. E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs. We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions.
arXiv Detail & Related papers (2022-09-12T17:39:55Z)
FP8 Quantization: The Power of the Exponent [19.179749424362686]
This paper investigates the benefit of the floating point format for neural network inference. We detail the choices that can be made for the FP8 format, including the important choice of the number of bits for the mantissa and exponent. We show how these findings translate to real networks, provide an efficient implementation for FP8 simulation, and a new algorithm.
arXiv Detail & Related papers (2022-08-19T09:03:00Z)
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [80.86029795281922]
We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers. A 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation.
arXiv Detail & Related papers (2022-08-15T17:08:50Z)
All-You-Can-Fit 8-Bit Flexible Floating-Point Format for Accurate and Memory-Efficient Inference of Deep Neural Networks [2.294014185517203]
This paper introduces an extremely flexible 8-bit floating-point (FFP8) format. It achieves an extremely low accuracy loss of $0.1%sim 0.3%$ for several representative image classification models. It is easy to turn a classical floating-point processing unit into an FFP8-compliant one, and the extra hardware cost is minor.
arXiv Detail & Related papers (2021-04-15T09:37:23Z)
Representation range needs for 16-bit neural network training [2.2657486535885094]
In floating-point arithmetic there is a tradeoff between precision and representation range as the number of exponent bits changes. We propose a 1/6/9 format, i.e., 6-bit exponent and 9-bit explicit mantissa, that offers a better range-precision tradeoff. We show that 1/6/9 mixed-precision training is able to speed up training on hardware that incurs a performance slowdown on denormal operations.
arXiv Detail & Related papers (2021-03-29T20:30:02Z)
HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values. We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.