Comparative Study: Standalone IEEE 16-bit Floating-Point for Image
Classification
- URL: http://arxiv.org/abs/2305.10947v2
- Date: Fri, 25 Aug 2023 05:57:08 GMT
- Title: Comparative Study: Standalone IEEE 16-bit Floating-Point for Image
Classification
- Authors: Juyoung Yun, Byungkon Kang, Francois Rameau, Zhoulai Fu
- Abstract summary: This study focuses on the widely accessible IEEE 16-bit format for comparative analysis.
Our study-supported by a series of rigorous experiments-provides a quantitative explanation of why standalone IEEE 16-bit floating-point neural networks can perform on par with 32-bit and mixed-precision networks in various image classification tasks.
- Score: 2.4321382081341962
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reducing the number of bits needed to encode the weights and activations of
neural networks is highly desirable as it speeds up their training and
inference time while reducing memory consumption. It is unsurprising that
considerable attention has been drawn to developing neural networks that employ
lower-precision computation. This includes IEEE 16-bit, Google bfloat16, 8-bit,
4-bit floating-point or fixed-point, 2-bit, and various mixed-precision
algorithms. Out of these low-precision formats, IEEE 16-bit stands out due to
its universal compatibility with contemporary GPUs. This accessibility
contrasts with bfloat16, which needs high-end GPUs, or other non-standard
fewer-bit designs, which typically require software simulation. This study
focuses on the widely accessible IEEE 16-bit format for comparative analysis.
This analysis involves an in-depth theoretical investigation of the factors
that lead to discrepancies between 16-bit and 32-bit models, including a
formalization of the concepts of floating-point error and tolerance to
understand the conditions under which a 16-bit model can approximate 32-bit
results. Contrary to literature that credits the success of noise-tolerated
neural networks to regularization effects, our study-supported by a series of
rigorous experiments-provides a quantitative explanation of why standalone IEEE
16-bit floating-point neural networks can perform on par with 32-bit and
mixed-precision networks in various image classification tasks. Because no
prior research has studied IEEE 16-bit as a standalone floating-point precision
in neural networks, we believe our findings will have significant impacts,
encouraging the adoption of standalone IEEE 16-bit networks in future neural
network applications.
Related papers
- Compressed Real Numbers for AI: a case-study using a RISC-V CPU [2.0516276923852415]
We focus on two families of formats that have achieved interesting results in compressing binary32 numbers in machine learning applications.
We propose a way to decompress a tensor of bfloat/posits just before computations.
arXiv Detail & Related papers (2023-09-11T07:54:28Z) - The Hidden Power of Pure 16-bit Floating-Point Neural Networks [1.9594704501292781]
Lowering the precision of neural networks from the prevalent 32-bit precision has long been considered harmful to performance.
This paper investigates the unexpected performance gain of pure 16-bit neural networks over the 32-bit networks in classification tasks.
arXiv Detail & Related papers (2023-01-30T12:01:45Z) - FP8 Formats for Deep Learning [49.54015320992368]
We propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings.
E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs.
We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions.
arXiv Detail & Related papers (2022-09-12T17:39:55Z) - ZippyPoint: Fast Interest Point Detection, Description, and Matching
through Mixed Precision Discretization [71.91942002659795]
We investigate and adapt network quantization techniques to accelerate inference and enable its use on compute limited platforms.
ZippyPoint, our efficient quantized network with binary descriptors, improves the network runtime speed, the descriptor matching speed, and the 3D model size.
These improvements come at a minor performance degradation as evaluated on the tasks of homography estimation, visual localization, and map-free visual relocalization.
arXiv Detail & Related papers (2022-03-07T18:59:03Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - PositNN: Training Deep Neural Networks with Mixed Low-Precision Posit [5.534626267734822]
The presented research aims to evaluate the feasibility to train deep convolutional neural networks using posits.
A software framework was developed to use simulated posits and quires in end-to-end training and inference.
Results suggest that 8-bit posits can substitute 32-bit floats during training with no negative impact on the resulting loss and accuracy.
arXiv Detail & Related papers (2021-04-30T19:30:37Z) - Representation range needs for 16-bit neural network training [2.2657486535885094]
In floating-point arithmetic there is a tradeoff between precision and representation range as the number of exponent bits changes.
We propose a 1/6/9 format, i.e., 6-bit exponent and 9-bit explicit mantissa, that offers a better range-precision tradeoff.
We show that 1/6/9 mixed-precision training is able to speed up training on hardware that incurs a performance slowdown on denormal operations.
arXiv Detail & Related papers (2021-03-29T20:30:02Z) - Revisiting BFloat16 Training [30.99618783594963]
State-of-the-art generic low-precision training algorithms use a mix of 16-bit and 32-bit precision.
Deep learning accelerators are forced to support both 16-bit and 32-bit floating-point units.
arXiv Detail & Related papers (2020-10-13T05:38:07Z) - Searching for Low-Bit Weights in Quantized Neural Networks [129.8319019563356]
Quantized neural networks with low-bit weights and activations are attractive for developing AI accelerators.
We present to regard the discrete weights in an arbitrary quantized neural network as searchable variables, and utilize a differential method to search them accurately.
arXiv Detail & Related papers (2020-09-18T09:13:26Z) - Efficient Integer-Arithmetic-Only Convolutional Neural Networks [87.01739569518513]
We replace conventional ReLU with Bounded ReLU and find that the decline is due to activation quantization.
Our integer networks achieve equivalent performance as the corresponding FPN networks, but have only 1/4 memory cost and run 2x faster on modern GPU.
arXiv Detail & Related papers (2020-06-21T08:23:03Z) - Widening and Squeezing: Towards Accurate and Efficient QNNs [125.172220129257]
Quantization neural networks (QNNs) are very attractive to the industry because their extremely cheap calculation and storage overhead, but their performance is still worse than that of networks with full-precision parameters.
Most of existing methods aim to enhance performance of QNNs especially binary neural networks by exploiting more effective training techniques.
We address this problem by projecting features in original full-precision networks to high-dimensional quantization features.
arXiv Detail & Related papers (2020-02-03T04:11:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.