Related papers: OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization

OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization

URL: http://arxiv.org/abs/2205.11141v1
Date: Mon, 23 May 2022 09:05:25 GMT
Title: OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization
Authors: Peng Hu, Xi Peng, Hongyuan Zhu, Mohamed M. Sabry Aly, Jie Lin
Abstract summary: We propose a novel One-shot Pruning-Quantization (OPQ) in this paper. OPQ analytically solves the compression allocation with pre-trained weight parameters only. We propose a unified channel-wise quantization method that enforces all channels of each layer to share a common codebook.
Score: 32.60139548889592
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As Deep Neural Networks (DNNs) usually are overparameterized and have millions of weight parameters, it is challenging to deploy these large DNN models on resource-constrained hardware platforms, e.g., smartphones. Numerous network compression methods such as pruning and quantization are proposed to reduce the model size significantly, of which the key is to find suitable compression allocation (e.g., pruning sparsity and quantization codebook) of each layer. Existing solutions obtain the compression allocation in an iterative/manual fashion while finetuning the compressed model, thus suffering from the efficiency issue. Different from the prior art, we propose a novel One-shot Pruning-Quantization (OPQ) in this paper, which analytically solves the compression allocation with pre-trained weight parameters only. During finetuning, the compression module is fixed and only weight parameters are updated. To our knowledge, OPQ is the first work that reveals pre-trained model is sufficient for solving pruning and quantization simultaneously, without any complex iterative/manual optimization at the finetuning stage. Furthermore, we propose a unified channel-wise quantization method that enforces all channels of each layer to share a common codebook, which leads to low bit-rate allocation without introducing extra overhead brought by traditional channel-wise quantization. Comprehensive experiments on ImageNet with AlexNet/MobileNet-V1/ResNet-50 show that our method improves accuracy and training efficiency while obtains significantly higher compression rates compared to the state-of-the-art.

Related papers

Flexible Mixed Precision Quantization for Learned Image Compression [4.847449762378203]
We propose a Flexible Mixed Precision Quantization (FMPQ) method that assigns different bit-widths to different layers of the quantized network.<n>We also introduce an adaptive search algorithm which reduces the time-complexity of searching for the desired distribution of quantization bit-widths.
arXiv Detail & Related papers (2025-06-02T00:12:50Z)
Reducing Storage of Pretrained Neural Networks by Rate-Constrained Quantization and Entropy Coding [56.066799081747845]
The ever-growing size of neural networks poses serious challenges on resource-constrained devices.<n>We propose a novel post-training compression framework that combines rate-aware quantization with entropy coding.<n>Our method allows for very fast decoding and is compatible with arbitrary quantization grids.
arXiv Detail & Related papers (2025-05-24T15:52:49Z)
Multi-Scale Invertible Neural Network for Wide-Range Variable-Rate Learned Image Compression [90.59962443790593]
In this paper, we present a variable-rate image compression model based on invertible transform to overcome limitations. Specifically, we design a lightweight multi-scale invertible neural network, which maps the input image into multi-scale latent representations. Experimental results demonstrate that the proposed method achieves state-of-the-art performance compared to existing variable-rate methods.
arXiv Detail & Related papers (2025-03-27T09:08:39Z)
Choose Your Model Size: Any Compression by a Single Gradient Descent [9.074689052563878]
We present Any Compression via Iterative Pruning (ACIP) ACIP is an algorithmic approach to determine a compression-performance trade-off from a single gradient descent run. We show that ACIP seamlessly complements common quantization-based compression techniques.
arXiv Detail & Related papers (2025-02-03T18:40:58Z)
DeepHQ: Learned Hierarchical Quantizer for Progressive Deep Image Coding [27.875207681547074]
progressive image coding (PIC) aims to compress various qualities of images into a single bitstream. Research on neural network (NN)-based PIC is in its early stages. We propose an NN-based progressive coding method that firstly utilizes learned quantization step sizes via learning for each quantization layer.
arXiv Detail & Related papers (2024-08-22T06:32:53Z)
Retraining-free Model Quantization via One-Shot Weight-Coupling Learning [41.299675080384]
Mixed-precision quantization (MPQ) is advocated to compress the model effectively by allocating heterogeneous bit-width for layers. MPQ is typically organized into a searching-retraining two-stage process. In this paper, we devise a one-shot training-searching paradigm for mixed-precision model compression.
arXiv Detail & Related papers (2024-01-03T05:26:57Z)
Rotation Invariant Quantization for Model Compression [7.633595230914364]
Post-training Neural Network (NN) model compression is an attractive approach for deploying large, memory-consuming models on devices with limited memory resources. We suggest a Rotation-Invariant Quantization (RIQ) technique that utilizes a single parameter to quantize the entire NN model.
arXiv Detail & Related papers (2023-03-03T10:53:30Z)
Vertical Layering of Quantized Neural Networks for Heterogeneous Inference [57.42762335081385]
We study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one. We can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model.
arXiv Detail & Related papers (2022-12-10T15:57:38Z)
Implicit Neural Representations for Image Compression [103.78615661013623]
Implicit Neural Representations (INRs) have gained attention as a novel and effective representation for various data types. We propose the first comprehensive compression pipeline based on INRs including quantization, quantization-aware retraining and entropy coding. We find that our approach to source compression with INRs vastly outperforms similar prior work.
arXiv Detail & Related papers (2021-12-08T13:02:53Z)
Towards Compact CNNs via Collaborative Compression [166.86915086497433]
We propose a Collaborative Compression scheme, which joints channel pruning and tensor decomposition to compress CNN models. We achieve 52.9% FLOPs reduction by removing 48.4% parameters on ResNet-50 with only a Top-1 accuracy drop of 0.56% on ImageNet 2012.
arXiv Detail & Related papers (2021-05-24T12:07:38Z)
BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization [32.770842274996774]
Mixed-precision quantization can potentially achieve the optimal tradeoff between performance and compression rate of deep neural networks. Previous methods either examine only a small manually-designed search space or utilize a cumbersome neural architecture search to explore the vast search space. This work proposes bit-level sparsity quantization (BSQ) to tackle the mixed-precision quantization from a new angle of inducing bit-level sparsity.
arXiv Detail & Related papers (2021-02-20T22:37:41Z)
Single-path Bit Sharing for Automatic Loss-aware Model Compression [126.98903867768732]
Single-path Bit Sharing (SBS) is able to significantly reduce computational cost while achieving promising performance. Our SBS compressed MobileNetV2 achieves 22.6x Bit-Operation (BOP) reduction with only 0.1% drop in the Top-1 accuracy.
arXiv Detail & Related papers (2021-01-13T08:28:21Z)
Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size. A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator. In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)
Kernel Quantization for Efficient Network Compression [59.55192551370948]
Kernel Quantization (KQ) aims to efficiently convert any pre-trained full-precision convolutional neural network (CNN) model into a low-precision version without significant performance loss. Inspired by the evolution from weight pruning to filter pruning, we propose to quantize in both kernel and weight level. Experiments on the ImageNet classification task prove that KQ needs 1.05 and 1.62 bits on average in VGG and ResNet18, respectively, to represent each parameter in the convolution layer.
arXiv Detail & Related papers (2020-03-11T08:00:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.