VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision
Neural Network Inference
- URL: http://arxiv.org/abs/2102.04503v1
- Date: Mon, 8 Feb 2021 19:56:04 GMT
- Title: VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision
Neural Network Inference
- Authors: Steve Dai, Rangharajan Venkatesan, Haoxing Ren, Brian Zimmer, William
J. Dally, Brucek Khailany
- Abstract summary: Quantization maps floating-point weights and activations in a trained model to low-bitwidth integer values using scale factors.
Excessive quantization, reducing precision too aggressively, results in accuracy degradation.
Per-vector scale factors can be implemented with low-bitwidth integers when using a two-level quantization scheme.
- Score: 7.886868529510128
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Quantization enables efficient acceleration of deep neural networks by
reducing model memory footprint and exploiting low-cost integer math hardware
units. Quantization maps floating-point weights and activations in a trained
model to low-bitwidth integer values using scale factors. Excessive
quantization, reducing precision too aggressively, results in accuracy
degradation. When scale factors are shared at a coarse granularity across many
dimensions of each tensor, effective precision of individual elements within
the tensor are limited. To reduce quantization-related accuracy loss, we
propose using a separate scale factor for each small vector of ($\approx$16-64)
elements within a single dimension of a tensor. To achieve an efficient
hardware implementation, the per-vector scale factors can be implemented with
low-bitwidth integers when calibrated using a two-level quantization scheme. We
find that per-vector scaling consistently achieves better inference accuracy at
low precision compared to conventional scaling techniques for popular neural
networks without requiring retraining. We also modify a deep learning
accelerator hardware design to study the area and energy overheads of
per-vector scaling support. Our evaluation demonstrates that per-vector scaled
quantization with 4-bit weights and activations achieves 37% area saving and
24% energy saving while maintaining over 75% accuracy for ResNet50 on ImageNet.
4-bit weights and 8-bit activations achieve near-full-precision accuracy for
both BERT-base and BERT-large on SQuAD while reducing area by 26% compared to
an 8-bit baseline.
Related papers
- Neural Precision Polarization: Simplifying Neural Network Inference with Dual-Level Precision [0.4124847249415279]
A floating-point model can be trained in the cloud and then downloaded to an edge device.
Network weights and activations are directly quantized to meet the edge devices' desired level, such as NF4 or INT8.
We show that neural precision polarization enables approximately 464 TOPS per Watt MAC efficiency and reliability.
arXiv Detail & Related papers (2024-11-06T16:02:55Z) - Low-Precision Floating-Point for Efficient On-Board Deep Neural Network
Processing [0.9374652839580183]
We study how to combine low precision (mini) floating-point arithmetic with a Quantization-Aware Training methodology.
Our results show that 6-bit floating-point quantization for both weights and activations can compete with single-precision.
An initial hardware study also confirms the potential impact of such low-precision floating-point designs.
arXiv Detail & Related papers (2023-11-18T21:36:52Z) - On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices.
For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator.
For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z) - Quantized Neural Networks for Low-Precision Accumulation with Guaranteed
Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference.
We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z) - Convolutional Neural Networks Quantization with Attention [1.0312968200748118]
We propose a method, double-stage Squeeze-and-Threshold (double-stage ST)
It uses the attention mechanism to quantize networks and achieve state-of-art results.
arXiv Detail & Related papers (2022-09-30T08:48:31Z) - n-hot: Efficient bit-level sparsity for powers-of-two neural network
quantization [0.0]
Powers-of-two (PoT) quantization reduces the number of bit operations of deep neural networks on resource-constrained hardware.
PoT quantization triggers a severe accuracy drop because of its limited representation ability.
We propose an efficient PoT quantization scheme that balances accuracy and costs in a memory-efficient way.
arXiv Detail & Related papers (2021-03-22T10:13:12Z) - HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.
We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z) - Searching for Low-Bit Weights in Quantized Neural Networks [129.8319019563356]
Quantized neural networks with low-bit weights and activations are attractive for developing AI accelerators.
We present to regard the discrete weights in an arbitrary quantized neural network as searchable variables, and utilize a differential method to search them accurately.
arXiv Detail & Related papers (2020-09-18T09:13:26Z) - AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation.
Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z) - Widening and Squeezing: Towards Accurate and Efficient QNNs [125.172220129257]
Quantization neural networks (QNNs) are very attractive to the industry because their extremely cheap calculation and storage overhead, but their performance is still worse than that of networks with full-precision parameters.
Most of existing methods aim to enhance performance of QNNs especially binary neural networks by exploiting more effective training techniques.
We address this problem by projecting features in original full-precision networks to high-dimensional quantization features.
arXiv Detail & Related papers (2020-02-03T04:11:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.