Representation range needs for 16-bit neural network training
- URL: http://arxiv.org/abs/2103.15940v1
- Date: Mon, 29 Mar 2021 20:30:02 GMT
- Title: Representation range needs for 16-bit neural network training
- Authors: Valentina Popescu and Abhinav Venigalla and Di Wu and Robert Schreiber
- Abstract summary: In floating-point arithmetic there is a tradeoff between precision and representation range as the number of exponent bits changes.
We propose a 1/6/9 format, i.e., 6-bit exponent and 9-bit explicit mantissa, that offers a better range-precision tradeoff.
We show that 1/6/9 mixed-precision training is able to speed up training on hardware that incurs a performance slowdown on denormal operations.
- Score: 2.2657486535885094
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep learning has grown rapidly thanks to its state-of-the-art performance
across a wide range of real-world applications. While neural networks have been
trained using IEEE-754 binary32 arithmetic, the rapid growth of computational
demands in deep learning has boosted interest in faster, low precision
training. Mixed-precision training that combines IEEE-754 binary16 with
IEEE-754 binary32 has been tried, and other $16$-bit formats, for example
Google's bfloat16, have become popular. In floating-point arithmetic there is a
tradeoff between precision and representation range as the number of exponent
bits changes; denormal numbers extend the representation range. This raises
questions of how much exponent range is needed, of whether there is a format
between binary16 (5 exponent bits) and bfloat16 (8 exponent bits) that works
better than either of them, and whether or not denormals are necessary.
In the current paper we study the need for denormal numbers for
mixed-precision training, and we propose a 1/6/9 format, i.e., 6-bit exponent
and 9-bit explicit mantissa, that offers a better range-precision tradeoff. We
show that 1/6/9 mixed-precision training is able to speed up training on
hardware that incurs a performance slowdown on denormal operations or
eliminates the need for denormal numbers altogether. And, for a number of fully
connected and convolutional neural networks in computer vision and natural
language processing, 1/6/9 achieves numerical parity to standard
mixed-precision.
Related papers
- Expressive Power of ReLU and Step Networks under Floating-Point Operations [11.29958155597398]
We show that neural networks using a binary threshold unit or ReLU can memorize any finite input/output pairs.
We also show similar results on memorization and universal approximation when floating-point operations use finite bits for both significand and exponent.
arXiv Detail & Related papers (2024-01-26T05:59:40Z) - Positional Description Matters for Transformers Arithmetic [58.4739272381373]
Transformers often falter on arithmetic tasks despite their vast capabilities.
We propose several ways to fix the issue, either by modifying the positional encoding directly, or by modifying the representation of the arithmetic task to leverage standard positional encoding differently.
arXiv Detail & Related papers (2023-11-22T00:31:01Z) - Guaranteed Approximation Bounds for Mixed-Precision Neural Operators [83.64404557466528]
We build on intuition that neural operator learning inherently induces an approximation error.
We show that our approach reduces GPU memory usage by up to 50% and improves throughput by 58% with little or no reduction in accuracy.
arXiv Detail & Related papers (2023-07-27T17:42:06Z) - FP8 Formats for Deep Learning [49.54015320992368]
We propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings.
E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs.
We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions.
arXiv Detail & Related papers (2022-09-12T17:39:55Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - PositNN: Training Deep Neural Networks with Mixed Low-Precision Posit [5.534626267734822]
The presented research aims to evaluate the feasibility to train deep convolutional neural networks using posits.
A software framework was developed to use simulated posits and quires in end-to-end training and inference.
Results suggest that 8-bit posits can substitute 32-bit floats during training with no negative impact on the resulting loss and accuracy.
arXiv Detail & Related papers (2021-04-30T19:30:37Z) - Deep Neural Network Training without Multiplications [0.0]
We show that ResNet can be trained using this operation with competitive classification accuracy.
This method will enable eliminating the multiplications in deep neural-network training and inference.
arXiv Detail & Related papers (2020-12-07T05:40:50Z) - HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.
We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z) - NITI: Training Integer Neural Networks Using Integer-only Arithmetic [4.361357921751159]
We present NITI, an efficient deep neural network training framework that computes exclusively with integer arithmetic.
A proof-of-concept open-source software implementation of NITI that utilizes native 8-bit integer operations is presented.
NITI achieves negligible accuracy degradation on the MNIST and CIFAR10 datasets using 8-bit integer storage and computation.
arXiv Detail & Related papers (2020-09-28T07:41:36Z) - Efficient Integer-Arithmetic-Only Convolutional Neural Networks [87.01739569518513]
We replace conventional ReLU with Bounded ReLU and find that the decline is due to activation quantization.
Our integer networks achieve equivalent performance as the corresponding FPN networks, but have only 1/4 memory cost and run 2x faster on modern GPU.
arXiv Detail & Related papers (2020-06-21T08:23:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.