OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization
- URL: http://arxiv.org/abs/2205.11141v1
- Date: Mon, 23 May 2022 09:05:25 GMT
- Title: OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization
- Authors: Peng Hu, Xi Peng, Hongyuan Zhu, Mohamed M. Sabry Aly, Jie Lin
- Abstract summary: We propose a novel One-shot Pruning-Quantization (OPQ) in this paper.
OPQ analytically solves the compression allocation with pre-trained weight parameters only.
We propose a unified channel-wise quantization method that enforces all channels of each layer to share a common codebook.
- Score: 32.60139548889592
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As Deep Neural Networks (DNNs) usually are overparameterized and have
millions of weight parameters, it is challenging to deploy these large DNN
models on resource-constrained hardware platforms, e.g., smartphones. Numerous
network compression methods such as pruning and quantization are proposed to
reduce the model size significantly, of which the key is to find suitable
compression allocation (e.g., pruning sparsity and quantization codebook) of
each layer. Existing solutions obtain the compression allocation in an
iterative/manual fashion while finetuning the compressed model, thus suffering
from the efficiency issue. Different from the prior art, we propose a novel
One-shot Pruning-Quantization (OPQ) in this paper, which analytically solves
the compression allocation with pre-trained weight parameters only. During
finetuning, the compression module is fixed and only weight parameters are
updated. To our knowledge, OPQ is the first work that reveals pre-trained model
is sufficient for solving pruning and quantization simultaneously, without any
complex iterative/manual optimization at the finetuning stage. Furthermore, we
propose a unified channel-wise quantization method that enforces all channels
of each layer to share a common codebook, which leads to low bit-rate
allocation without introducing extra overhead brought by traditional
channel-wise quantization. Comprehensive experiments on ImageNet with
AlexNet/MobileNet-V1/ResNet-50 show that our method improves accuracy and
training efficiency while obtains significantly higher compression rates
compared to the state-of-the-art.
Related papers
- DeepHQ: Learned Hierarchical Quantizer for Progressive Deep Image Coding [27.875207681547074]
progressive image coding (PIC) aims to compress various qualities of images into a single bitstream.
Research on neural network (NN)-based PIC is in its early stages.
We propose an NN-based progressive coding method that firstly utilizes learned quantization step sizes via learning for each quantization layer.
arXiv Detail & Related papers (2024-08-22T06:32:53Z) - Retraining-free Model Quantization via One-Shot Weight-Coupling Learning [41.299675080384]
Mixed-precision quantization (MPQ) is advocated to compress the model effectively by allocating heterogeneous bit-width for layers.
MPQ is typically organized into a searching-retraining two-stage process.
In this paper, we devise a one-shot training-searching paradigm for mixed-precision model compression.
arXiv Detail & Related papers (2024-01-03T05:26:57Z) - Rotation Invariant Quantization for Model Compression [7.633595230914364]
Post-training Neural Network (NN) model compression is an attractive approach for deploying large, memory-consuming models on devices with limited memory resources.
We suggest a Rotation-Invariant Quantization (RIQ) technique that utilizes a single parameter to quantize the entire NN model.
arXiv Detail & Related papers (2023-03-03T10:53:30Z) - Vertical Layering of Quantized Neural Networks for Heterogeneous
Inference [57.42762335081385]
We study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one.
We can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model.
arXiv Detail & Related papers (2022-12-10T15:57:38Z) - Implicit Neural Representations for Image Compression [103.78615661013623]
Implicit Neural Representations (INRs) have gained attention as a novel and effective representation for various data types.
We propose the first comprehensive compression pipeline based on INRs including quantization, quantization-aware retraining and entropy coding.
We find that our approach to source compression with INRs vastly outperforms similar prior work.
arXiv Detail & Related papers (2021-12-08T13:02:53Z) - Towards Compact CNNs via Collaborative Compression [166.86915086497433]
We propose a Collaborative Compression scheme, which joints channel pruning and tensor decomposition to compress CNN models.
We achieve 52.9% FLOPs reduction by removing 48.4% parameters on ResNet-50 with only a Top-1 accuracy drop of 0.56% on ImageNet 2012.
arXiv Detail & Related papers (2021-05-24T12:07:38Z) - BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network
Quantization [32.770842274996774]
Mixed-precision quantization can potentially achieve the optimal tradeoff between performance and compression rate of deep neural networks.
Previous methods either examine only a small manually-designed search space or utilize a cumbersome neural architecture search to explore the vast search space.
This work proposes bit-level sparsity quantization (BSQ) to tackle the mixed-precision quantization from a new angle of inducing bit-level sparsity.
arXiv Detail & Related papers (2021-02-20T22:37:41Z) - Single-path Bit Sharing for Automatic Loss-aware Model Compression [126.98903867768732]
Single-path Bit Sharing (SBS) is able to significantly reduce computational cost while achieving promising performance.
Our SBS compressed MobileNetV2 achieves 22.6x Bit-Operation (BOP) reduction with only 0.1% drop in the Top-1 accuracy.
arXiv Detail & Related papers (2021-01-13T08:28:21Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z) - Kernel Quantization for Efficient Network Compression [59.55192551370948]
Kernel Quantization (KQ) aims to efficiently convert any pre-trained full-precision convolutional neural network (CNN) model into a low-precision version without significant performance loss.
Inspired by the evolution from weight pruning to filter pruning, we propose to quantize in both kernel and weight level.
Experiments on the ImageNet classification task prove that KQ needs 1.05 and 1.62 bits on average in VGG and ResNet18, respectively, to represent each parameter in the convolution layer.
arXiv Detail & Related papers (2020-03-11T08:00:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.