UWC: Unit-wise Calibration Towards Rapid Network Compression
- URL: http://arxiv.org/abs/2201.06376v1
- Date: Mon, 17 Jan 2022 12:27:35 GMT
- Title: UWC: Unit-wise Calibration Towards Rapid Network Compression
- Authors: Chen Lin, Zheyang Li, Bo Peng, Haoji Hu, Wenming Tan, Ye Ren, Shiliang
Pu
- Abstract summary: Post-training quantization method achieves highly efficient Convolutional Neural Network (CNN) quantization with high performance.
Previous PTQ methods usually reduce compression error via performing layer-by-layer parameters calibration.
This work proposes a unit-wise feature reconstruction algorithm based on an observation of second order Taylor series expansion of the unit-wise error.
- Score: 36.74654186255557
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces a post-training quantization~(PTQ) method achieving
highly efficient Convolutional Neural Network~ (CNN) quantization with high
performance. Previous PTQ methods usually reduce compression error via
performing layer-by-layer parameters calibration. However, with lower
representational ability of extremely compressed parameters (e.g., the
bit-width goes less than 4), it is hard to eliminate all the layer-wise errors.
This work addresses this issue via proposing a unit-wise feature reconstruction
algorithm based on an observation of second order Taylor series expansion of
the unit-wise error. It indicates that leveraging the interaction between
adjacent layers' parameters could compensate layer-wise errors better. In this
paper, we define several adjacent layers as a Basic-Unit, and present a
unit-wise post-training algorithm which can minimize quantization error. This
method achieves near-original accuracy on ImageNet and COCO when quantizing
FP32 models to INT4 and INT3.
Related papers
- GAQAT: gradient-adaptive quantization-aware training for domain generalization [54.31450550793485]
We propose a novel Gradient-Adaptive Quantization-Aware Training (GAQAT) framework for DG.
Our approach begins by identifying the scale-gradient conflict problem in low-precision quantization.
Extensive experiments validate the effectiveness of the proposed GAQAT framework.
arXiv Detail & Related papers (2024-12-07T06:07:21Z) - Pushing the Limits of Large Language Model Quantization via the Linearity Theorem [71.3332971315821]
We present a "line theoremarity" establishing a direct relationship between the layer-wise $ell$ reconstruction error and the model perplexity increase due to quantization.
This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels.
arXiv Detail & Related papers (2024-11-26T15:35:44Z) - Differential error feedback for communication-efficient decentralized learning [48.924131251745266]
We propose a new decentralized communication-efficient learning approach that blends differential quantization with error feedback.
We show that the resulting communication-efficient strategy is stable both in terms of mean-square error and average bit rate.
The results establish that, in the small step-size regime and with a finite number of bits, it is possible to attain the performance achievable in the absence of compression.
arXiv Detail & Related papers (2024-06-26T15:11:26Z) - Minimize Quantization Output Error with Bias Compensation [35.43358597502087]
Quantization is a promising method that reduces memory usage and computational intensity of Deep Neural Networks (DNNs)
In this paper, we propose a method that improves accuracy without quantizing the output error.
We conduct experiments on Vision models and Large Language Models.
arXiv Detail & Related papers (2024-04-02T12:29:31Z) - COMQ: A Backpropagation-Free Algorithm for Post-Training Quantization [8.214857267270807]
Post-training quantization (PTQ) has emerged as a practical approach to compress large neural networks.
We propose an innovative PTQ algorithm termed COMQ, which sequentially conducts coordinate-wise minimization of the layer-wise reconstruction errors.
COMQ achieves remarkable results in quantizing 4-bit Vision Transformers, with a negligible loss of less than 1% in Top-1 accuracy.
arXiv Detail & Related papers (2024-03-11T20:04:03Z) - Mixed-Precision Quantization with Cross-Layer Dependencies [6.338965603383983]
Mixed-precision quantization (MPQ) assigns varied bit-widths to layers to optimize the accuracy-efficiency trade-off.
Existing methods simplify the MPQ problem by assuming that quantization errors at different layers act independently.
We show that this assumption does not reflect the true behavior of quantized deep neural networks.
arXiv Detail & Related papers (2023-07-11T15:56:00Z) - CADyQ: Content-Aware Dynamic Quantization for Image Super-Resolution [55.50793823060282]
We propose a novel Content-Aware Dynamic Quantization (CADyQ) method for image super-resolution (SR) networks.
CADyQ allocates optimal bits to local regions and layers adaptively based on the local contents of an input image.
The pipeline has been tested on various SR networks and evaluated on several standard benchmarks.
arXiv Detail & Related papers (2022-07-21T07:50:50Z) - Low-rank Tensor Decomposition for Compression of Convolutional Neural
Networks Using Funnel Regularization [1.8579693774597708]
We propose a model reduction method to compress the pre-trained networks using low-rank tensor decomposition.
A new regularization method, called funnel function, is proposed to suppress the unimportant factors during the compression.
For ResNet18 with ImageNet2012, our reduced model can reach more than twi times speed up in terms of GMAC with merely 0.7% Top-1 accuracy drop.
arXiv Detail & Related papers (2021-12-07T13:41:51Z) - AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation.
Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z) - Compression of descriptor models for mobile applications [26.498907514590165]
We evaluate the computational cost, model size, and matching accuracy tradeoffs for deep neural networks.
We observe a significant redundancy in the learned weights, which we exploit through the use of depthwise separable layers.
We propose the Convolution-Depthwise-Pointwise(CDP) layer, which provides a means of interpolating between the standard and depthwise separable convolutions.
arXiv Detail & Related papers (2020-01-09T17:00:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.