SQuant: On-the-Fly Data-Free Quantization via Diagonal Hessian
Approximation
- URL: http://arxiv.org/abs/2202.07471v1
- Date: Mon, 14 Feb 2022 01:57:33 GMT
- Title: SQuant: On-the-Fly Data-Free Quantization via Diagonal Hessian
Approximation
- Authors: Cong Guo, Yuxian Qiu, Jingwen Leng, Xiaotian Gao, Chen Zhang, Yunxin
Liu, Fan Yang, Yuhao Zhu, Minyi Guo
- Abstract summary: Quantization of deep neural networks (DNN) has been proven effective for compressing and accelerating models.
Data-free quantization (DFQ) is a promising approach without the original datasets under privacy-sensitive and confidential scenarios.
This paper proposes an on-the-fly DFQ framework with sub-second quantization time, called SQuant, which can quantize networks on inference-only devices.
- Score: 22.782678826199206
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Quantization of deep neural networks (DNN) has been proven effective for
compressing and accelerating DNN models. Data-free quantization (DFQ) is a
promising approach without the original datasets under privacy-sensitive and
confidential scenarios. However, current DFQ solutions degrade accuracy, need
synthetic data to calibrate networks, and are time-consuming and costly. This
paper proposes an on-the-fly DFQ framework with sub-second quantization time,
called SQuant, which can quantize networks on inference-only devices with low
computation and memory requirements. With the theoretical analysis of the
second-order information of DNN task loss, we decompose and approximate the
Hessian-based optimization objective into three diagonal sub-items, which have
different areas corresponding to three dimensions of weight tensor:
element-wise, kernel-wise, and output channel-wise. Then, we progressively
compose sub-items and propose a novel data-free optimization objective in the
discrete domain, minimizing Constrained Absolute Sum of Error (or CASE in
short), which surprisingly does not need any dataset and is even not aware of
network architecture. We also design an efficient algorithm without
back-propagation to further reduce the computation complexity of the objective
solver. Finally, without fine-tuning and synthetic datasets, SQuant accelerates
the data-free quantization process to a sub-second level with >30% accuracy
improvement over the existing data-free post-training quantization works, with
the evaluated models under 4-bit quantization. We have open-sourced the SQuant
framework at https://github.com/clevercool/SQuant.
Related papers
- NAF: Neural Attenuation Fields for Sparse-View CBCT Reconstruction [79.13750275141139]
This paper proposes a novel and fast self-supervised solution for sparse-view CBCT reconstruction.
The desired attenuation coefficients are represented as a continuous function of 3D spatial coordinates, parameterized by a fully-connected deep neural network.
A learning-based encoder entailing hash coding is adopted to help the network capture high-frequency details.
arXiv Detail & Related papers (2022-09-29T04:06:00Z) - SVNet: Where SO(3) Equivariance Meets Binarization on Point Cloud
Representation [65.4396959244269]
The paper tackles the challenge by designing a general framework to construct 3D learning architectures.
The proposed approach can be applied to general backbones like PointNet and DGCNN.
Experiments on ModelNet40, ShapeNet, and the real-world dataset ScanObjectNN, demonstrated that the method achieves a great trade-off between efficiency, rotation, and accuracy.
arXiv Detail & Related papers (2022-09-13T12:12:19Z) - Quantune: Post-training Quantization of Convolutional Neural Networks
using Extreme Gradient Boosting for Fast Deployment [15.720551497037176]
We propose an auto-tuner known as Quantune to accelerate the search for the configurations of quantization.
We show that Quantune reduces the search time for quantization by approximately 36.5x with an accuracy loss of 0.07 0.65% across six CNN models.
arXiv Detail & Related papers (2022-02-10T14:05:02Z) - Q-SpiNN: A Framework for Quantizing Spiking Neural Networks [14.727296040550392]
A prominent technique for reducing the memory footprint of Spiking Neural Networks (SNNs) without decreasing the accuracy significantly is quantization.
We propose Q-SpiNN, a novel quantization framework for memory-efficient SNNs.
For the unsupervised network, the Q-SpiNN reduces the memory footprint by ca. 4x, while maintaining the accuracy within 1% from the baseline on the MNIST dataset.
For the supervised network, the Q-SpiNN reduces the memory by ca. 2x, while keeping the accuracy within 2% from the baseline on the DVS-Gesture dataset
arXiv Detail & Related papers (2021-07-05T06:01:15Z) - ANNETTE: Accurate Neural Network Execution Time Estimation with Stacked
Models [56.21470608621633]
We propose a time estimation framework to decouple the architectural search from the target hardware.
The proposed methodology extracts a set of models from micro- kernel and multi-layer benchmarks and generates a stacked model for mapping and network execution time estimation.
We compare estimation accuracy and fidelity of the generated mixed models, statistical models with the roofline model, and a refined roofline model for evaluation.
arXiv Detail & Related papers (2021-05-07T11:39:05Z) - Filter Pre-Pruning for Improved Fine-tuning of Quantized Deep Neural
Networks [0.0]
We propose a new pruning method called Pruning for Quantization (PfQ) which removes the filters that disturb the fine-tuning of the DNN.
Experiments using well-known models and datasets confirmed that the proposed method achieves higher performance with a similar model size.
arXiv Detail & Related papers (2020-11-13T04:12:54Z) - AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation.
Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z) - APQ: Joint Search for Network Architecture, Pruning and Quantization
Policy [49.3037538647714]
We present APQ for efficient deep learning inference on resource-constrained hardware.
Unlike previous methods that separately search the neural architecture, pruning policy, and quantization policy, we optimize them in a joint manner.
With the same accuracy, APQ reduces the latency/energy by 2x/1.3x over MobileNetV2+HAQ.
arXiv Detail & Related papers (2020-06-15T16:09:17Z) - VecQ: Minimal Loss DNN Model Compression With Vectorized Weight
Quantization [19.66522714831141]
We develop a new quantization solution called VecQ, which can guarantee minimal direct quantization loss and better model accuracy.
In addition, in order to up the proposed quantization process during training, we accelerate the quantization process with a parameterized estimation and probability-based calculation.
arXiv Detail & Related papers (2020-05-18T07:38:44Z) - Widening and Squeezing: Towards Accurate and Efficient QNNs [125.172220129257]
Quantization neural networks (QNNs) are very attractive to the industry because their extremely cheap calculation and storage overhead, but their performance is still worse than that of networks with full-precision parameters.
Most of existing methods aim to enhance performance of QNNs especially binary neural networks by exploiting more effective training techniques.
We address this problem by projecting features in original full-precision networks to high-dimensional quantization features.
arXiv Detail & Related papers (2020-02-03T04:11:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.