Comparative Study: Standalone IEEE 16-bit Floating-Point for Image
Classification
- URL: http://arxiv.org/abs/2305.10947v2
- Date: Fri, 25 Aug 2023 05:57:08 GMT
- Title: Comparative Study: Standalone IEEE 16-bit Floating-Point for Image
Classification
- Authors: Juyoung Yun, Byungkon Kang, Francois Rameau, Zhoulai Fu
- Abstract summary: This study focuses on the widely accessible IEEE 16-bit format for comparative analysis.
Our study-supported by a series of rigorous experiments-provides a quantitative explanation of why standalone IEEE 16-bit floating-point neural networks can perform on par with 32-bit and mixed-precision networks in various image classification tasks.
- Score: 2.4321382081341962
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reducing the number of bits needed to encode the weights and activations of
neural networks is highly desirable as it speeds up their training and
inference time while reducing memory consumption. It is unsurprising that
considerable attention has been drawn to developing neural networks that employ
lower-precision computation. This includes IEEE 16-bit, Google bfloat16, 8-bit,
4-bit floating-point or fixed-point, 2-bit, and various mixed-precision
algorithms. Out of these low-precision formats, IEEE 16-bit stands out due to
its universal compatibility with contemporary GPUs. This accessibility
contrasts with bfloat16, which needs high-end GPUs, or other non-standard
fewer-bit designs, which typically require software simulation. This study
focuses on the widely accessible IEEE 16-bit format for comparative analysis.
This analysis involves an in-depth theoretical investigation of the factors
that lead to discrepancies between 16-bit and 32-bit models, including a
formalization of the concepts of floating-point error and tolerance to
understand the conditions under which a 16-bit model can approximate 32-bit
results. Contrary to literature that credits the success of noise-tolerated
neural networks to regularization effects, our study-supported by a series of
rigorous experiments-provides a quantitative explanation of why standalone IEEE
16-bit floating-point neural networks can perform on par with 32-bit and
mixed-precision networks in various image classification tasks. Because no
prior research has studied IEEE 16-bit as a standalone floating-point precision
in neural networks, we believe our findings will have significant impacts,
encouraging the adoption of standalone IEEE 16-bit networks in future neural
network applications.
Related papers
- Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning [54.970571745690634]
This work presents the first systematic investigation into how numerical precision affects Large Language Models inference.<n>We develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32.<n>Inspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32.
arXiv Detail & Related papers (2025-06-11T08:23:53Z) - "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
We evaluate popular quantization formats across academic benchmarks and real-world tasks.
We find that W4A16 offers the best costefficiency for synchronous deployments, and for asynchronous deployment on mid-tier architectures.
arXiv Detail & Related papers (2024-11-04T18:21:59Z) - Continuous 16-bit Training: Accelerating 32-bit Pre-Trained Neural
Networks [0.0]
This study introduces a novel approach where we continue the training of pre-existing 32-bit models using 16-bit precision.
By adopting 16-bit precision for ongoing training, we are able to substantially decrease memory requirements and computational burden.
Our experiments show that this method maintains the high standards of accuracy set by the original 32-bit training while providing a much-needed boost in training speed.
arXiv Detail & Related papers (2023-11-30T14:28:25Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - The Hidden Power of Pure 16-bit Floating-Point Neural Networks [1.9594704501292781]
Lowering the precision of neural networks from the prevalent 32-bit precision has long been considered harmful to performance.
This paper investigates the unexpected performance gain of pure 16-bit neural networks over the 32-bit networks in classification tasks.
arXiv Detail & Related papers (2023-01-30T12:01:45Z) - The case for 4-bit precision: k-bit Inference Scaling Laws [75.4335600212427]
Quantization methods reduce the number of bits required to represent each parameter in a model.
The final model size depends on both the number of parameters of the original model and the rate of compression.
We run more than 35,000 zero-shot experiments with 16-bit inputs and k-bit parameters to examine which quantization methods improve scaling for 3 to 8-bit precision.
arXiv Detail & Related papers (2022-12-19T18:48:33Z) - FP8 Formats for Deep Learning [49.54015320992368]
We propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings.
E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs.
We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions.
arXiv Detail & Related papers (2022-09-12T17:39:55Z) - MAPLE: Microprocessor A Priori for Latency Estimation [81.91509153539566]
Modern deep neural networks must demonstrate state-of-the-art accuracy while exhibiting low latency and energy consumption.
Measuring the latency of every evaluated architecture adds a significant amount of time to the NAS process.
We propose Microprocessor A Priori for Estimation Estimation MAPLE that does not rely on transfer learning or domain adaptation.
arXiv Detail & Related papers (2021-11-30T03:52:15Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - PositNN: Training Deep Neural Networks with Mixed Low-Precision Posit [5.534626267734822]
The presented research aims to evaluate the feasibility to train deep convolutional neural networks using posits.
A software framework was developed to use simulated posits and quires in end-to-end training and inference.
Results suggest that 8-bit posits can substitute 32-bit floats during training with no negative impact on the resulting loss and accuracy.
arXiv Detail & Related papers (2021-04-30T19:30:37Z) - FBGEMM: Enabling High-Performance Low-Precision Deep Learning Inference [1.1292678337479967]
fbgemm is a high-performance kernel library for high-performance quantized inference on current generation CPUs.
fbgemm achieves efficiency by fusing common quantization operations with a high-performance gemm implementation and by shape- and size-specific kernel code generation at runtime.
The library has been deployed at Facebook, where it delivers greater than 2x performance gains with respect to our current production baseline.
arXiv Detail & Related papers (2021-01-13T00:34:04Z) - Revisiting BFloat16 Training [30.99618783594963]
State-of-the-art generic low-precision training algorithms use a mix of 16-bit and 32-bit precision.
Deep learning accelerators are forced to support both 16-bit and 32-bit floating-point units.
arXiv Detail & Related papers (2020-10-13T05:38:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.