Quantized Neural Network Inference with Precision Batching
- URL: http://arxiv.org/abs/2003.00822v1
- Date: Wed, 26 Feb 2020 19:34:11 GMT
- Title: Quantized Neural Network Inference with Precision Batching
- Authors: Maximilian Lam, Zachary Yedidia, Colby Banbury, Vijay Janapa Reddi
- Abstract summary: Precision decomposes a neural network into individual bitlayers and accumulates them using fast 1-bit operations.
Precision yields end-to-endups of over 8x on a GPU within a 1% error margin of the full precision baseline.
Across a variety of applications, Precision yields end-to-endups of over 8x on a GPU within a 1% error margin of the full precision baseline.
- Score: 4.519884877213097
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present PrecisionBatching, a quantized inference algorithm for speeding up
neural network execution on traditional hardware platforms at low bitwidths
without the need for retraining or recalibration. PrecisionBatching decomposes
a neural network into individual bitlayers and accumulates them using fast
1-bit operations while maintaining activations in full precision.
PrecisionBatching not only facilitates quantized inference at low bitwidths (<
8 bits) without the need for retraining/recalibration, but also 1) enables
traditional hardware platforms the ability to realize inference speedups at a
finer granularity of quantization (e.g: 1-16 bit execution) and 2) allows
accuracy and speedup tradeoffs at runtime by exposing the number of bitlayers
to accumulate as a tunable parameter. Across a variety of applications (MNIST,
language modeling, natural language inference) and neural network architectures
(fully connected, RNN, LSTM), PrecisionBatching yields end-to-end speedups of
over 8x on a GPU within a < 1% error margin of the full precision baseline,
outperforming traditional 8-bit quantized inference by over 1.5x-2x at the same
error tolerance.
Related papers
- DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural
Network Inference [28.912023025671868]
This work targets an adaptive data representation with variable-length encoding called DyBit.
We also propose a hardware-aware quantization framework with a mixed-precision accelerator to trade-off the inference accuracy and speedup.
Experimental results demonstrate that the inference accuracy via DyBit is 1.997% higher than the state-of-the-art at 4-bit quantization.
arXiv Detail & Related papers (2023-02-24T08:46:01Z) - Quantized Neural Networks for Low-Precision Accumulation with Guaranteed
Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference.
We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - OMPQ: Orthogonal Mixed Precision Quantization [64.59700856607017]
Mixed precision quantization takes advantage of hardware's multiple bit-width arithmetic operations to unleash the full potential of network quantization.
We propose to optimize a proxy metric, the concept of networkity, which is highly correlated with the loss of the integer programming.
This approach reduces the search time and required data amount by orders of magnitude, with little compromise on quantization accuracy.
arXiv Detail & Related papers (2021-09-16T10:59:33Z) - VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision
Neural Network Inference [7.886868529510128]
Quantization maps floating-point weights and activations in a trained model to low-bitwidth integer values using scale factors.
Excessive quantization, reducing precision too aggressively, results in accuracy degradation.
Per-vector scale factors can be implemented with low-bitwidth integers when using a two-level quantization scheme.
arXiv Detail & Related papers (2021-02-08T19:56:04Z) - On the quantization of recurrent neural networks [9.549757800469196]
quantization of neural networks can be defined as the approximation of the high precision computation of the canonical neural network formulation.
We present an integer-only quantization strategy for Long Short-Term Memory (LSTM) neural network topologies.
arXiv Detail & Related papers (2021-01-14T04:25:08Z) - HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.
We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z) - Fast Implementation of 4-bit Convolutional Neural Networks for Mobile
Devices [0.8362190332905524]
We show an efficient implementation of 4-bit matrix multiplication for quantized neural networks.
We also demonstrate a 4-bit quantized neural network for OCR recognition on the MIDV-500 dataset.
The results show that 4-bit quantization perfectly suits mobile devices, yielding good enough accuracy and low inference time.
arXiv Detail & Related papers (2020-09-14T14:48:40Z) - Efficient Integer-Arithmetic-Only Convolutional Neural Networks [87.01739569518513]
We replace conventional ReLU with Bounded ReLU and find that the decline is due to activation quantization.
Our integer networks achieve equivalent performance as the corresponding FPN networks, but have only 1/4 memory cost and run 2x faster on modern GPU.
arXiv Detail & Related papers (2020-06-21T08:23:03Z) - Shifted and Squeezed 8-bit Floating Point format for Low-Precision
Training of Deep Neural Networks [13.929168096016957]
We introduce a novel methodology for training deep neural networks using 8-bit floating point (FP8) numbers.
Reduced bit precision allows for a larger effective memory and increased computational speed.
We show that, unlike previous 8-bit precision training methods, the proposed method works out-of-the-box for representative models.
arXiv Detail & Related papers (2020-01-16T06:38:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.