BRECQ: Pushing the Limit of Post-Training Quantization by Block
Reconstruction
- URL: http://arxiv.org/abs/2102.05426v1
- Date: Wed, 10 Feb 2021 13:46:16 GMT
- Title: BRECQ: Pushing the Limit of Post-Training Quantization by Block
Reconstruction
- Authors: Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei
Yu, Wei Wang, Shi Gu
- Abstract summary: We study the challenging task of neural network quantization without end-to-end retraining, called Post-training Quantization (PTQ)
We propose a novel PTQ framework, dubbed BRECQ, which pushes the limits of bitwidth in PTQ down to INT2 for the first time.
For the first time we prove that, without bells and whistles, PTQ can attain 4-bit ResNet and MobileNetV2 comparable with QAT and enjoy 240 times faster production of quantized models.
- Score: 29.040991149922615
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We study the challenging task of neural network quantization without
end-to-end retraining, called Post-training Quantization (PTQ). PTQ usually
requires a small subset of training data but produces less powerful quantized
models than Quantization-Aware Training (QAT). In this work, we propose a novel
PTQ framework, dubbed BRECQ, which pushes the limits of bitwidth in PTQ down to
INT2 for the first time. BRECQ leverages the basic building blocks in neural
networks and reconstructs them one-by-one. In a comprehensive theoretical study
of the second-order error, we show that BRECQ achieves a good balance between
cross-layer dependency and generalization error. To further employ the power of
quantization, the mixed precision technique is incorporated in our framework by
approximating the inter-layer and intra-layer sensitivity. Extensive
experiments on various handcrafted and searched neural architectures are
conducted for both image classification and object detection tasks. And for the
first time we prove that, without bells and whistles, PTQ can attain 4-bit
ResNet and MobileNetV2 comparable with QAT and enjoy 240 times faster
production of quantized models. Codes are available at
https://github.com/yhhhli/BRECQ.
Related papers
- Projected Stochastic Gradient Descent with Quantum Annealed Binary Gradients [51.82488018573326]
We present QP-SBGD, a novel layer-wise optimiser tailored towards training neural networks with binary weights.
BNNs reduce the computational requirements and energy consumption of deep learning models with minimal loss in accuracy.
Our algorithm is implemented layer-wise, making it suitable to train larger networks on resource-limited quantum hardware.
arXiv Detail & Related papers (2023-10-23T17:32:38Z) - RAND: Robustness Aware Norm Decay For Quantized Seq2seq Models [14.07649230604283]
We propose low complexity changes to the quantization aware training (QAT) process to improve model accuracy.
With the improved accuracy, it opens up the possibility to exploit some of the other benefits of noise based QAT.
arXiv Detail & Related papers (2023-05-24T19:45:56Z) - RepQ-ViT: Scale Reparameterization for Post-Training Quantization of
Vision Transformers [2.114921680609289]
We propose RepQ-ViT, a novel PTQ framework for vision transformers (ViTs)
RepQ-ViT decouples the quantization and inference processes.
It can outperform existing strong baselines and encouragingly improve the accuracy of 4-bit PTQ of ViTs to a usable level.
arXiv Detail & Related papers (2022-12-16T02:52:37Z) - RAPQ: Rescuing Accuracy for Power-of-Two Low-bit Post-training
Quantization [4.8018862391424095]
We introduce a Power-of-Two post-training quantization( PTQ) method for deep neural network that meets hardware requirements.
We propose a novel Power-of-Two PTQ framework, dubbed RAPQ, which dynamically adjusts the Power-of-Two scales of the whole network.
We are the first to propose PTQ for the more constrained but hardware-friendly Power-of-Two quantization and prove that it can achieve nearly the same accuracy as SOTA PTQ method.
arXiv Detail & Related papers (2022-04-26T14:02:04Z) - QDrop: Randomly Dropping Quantization for Extremely Low-bit
Post-Training Quantization [54.44028700760694]
Post-training quantization (PTQ) has driven much attention to produce efficient neural networks without long-time retraining.
In this study, we pioneeringly confirm that properly incorporating activation quantization into the PTQ reconstruction benefits the final accuracy.
Based on the conclusion, a simple yet effective approach dubbed as QDROP is proposed, which randomly drops the quantization of activations during PTQ.
arXiv Detail & Related papers (2022-03-11T04:01:53Z) - MQBench: Towards Reproducible and Deployable Model Quantization
Benchmark [53.12623958951738]
MQBench is a first attempt to evaluate, analyze, and benchmark the and deployability for model quantization algorithms.
We choose multiple platforms for real-world deployments, including CPU, GPU, ASIC, DSP, and evaluate extensive state-of-the-art quantization algorithms.
We conduct a comprehensive analysis and find considerable intuitive or counter-intuitive insights.
arXiv Detail & Related papers (2021-11-05T23:38:44Z) - Towards Efficient Post-training Quantization of Pre-trained Language
Models [85.68317334241287]
We study post-training quantization(PTQ) of PLMs, and propose module-wise quantization error minimization(MREM), an efficient solution to mitigate these issues.
Experiments on GLUE and SQuAD benchmarks show that our proposed PTQ solution not only performs close to QAT, but also enjoys significant reductions in training time, memory overhead, and data consumption.
arXiv Detail & Related papers (2021-09-30T12:50:06Z) - Cluster-Promoting Quantization with Bit-Drop for Minimizing Network
Quantization Loss [61.26793005355441]
Cluster-Promoting Quantization (CPQ) finds the optimal quantization grids for neural networks.
DropBits is a new bit-drop technique that revises the standard dropout regularization to randomly drop bits instead of neurons.
We experimentally validate our method on various benchmark datasets and network architectures.
arXiv Detail & Related papers (2021-09-05T15:15:07Z) - A White Paper on Neural Network Quantization [20.542729144379223]
We introduce state-of-the-art algorithms for mitigating the impact of quantization noise on the network's performance.
We consider two main classes of algorithms: Post-Training Quantization (PTQ) and Quantization-Aware-Training (QAT)
arXiv Detail & Related papers (2021-06-15T17:12:42Z) - Once Quantization-Aware Training: High Performance Extremely Low-bit
Architecture Search [112.05977301976613]
We propose to combine Network Architecture Search methods with quantization to enjoy the merits of the two sides.
We first propose the joint training of architecture and quantization with a shared step size to acquire a large number of quantized models.
Then a bit-inheritance scheme is introduced to transfer the quantized models to the lower bit, which further reduces the time cost and improves the quantization accuracy.
arXiv Detail & Related papers (2020-10-09T03:52:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.