Automated Model Compression by Jointly Applied Pruning and Quantization
- URL: http://arxiv.org/abs/2011.06231v1
- Date: Thu, 12 Nov 2020 07:06:29 GMT
- Title: Automated Model Compression by Jointly Applied Pruning and Quantization
- Authors: Wenting Tang, Xingxing Wei, Bo Li
- Abstract summary: In the traditional deep compression framework, iteratively performing network pruning and quantization can reduce the model size and computation cost.
We tackle this issue by integrating network pruning and quantization as a unified joint compression problem and then use AutoML to automatically solve it.
We propose the automated model compression by jointly applied pruning and quantization (AJPQ)
- Score: 14.824593320721407
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the traditional deep compression framework, iteratively performing network
pruning and quantization can reduce the model size and computation cost to meet
the deployment requirements. However, such a step-wise application of pruning
and quantization may lead to suboptimal solutions and unnecessary time
consumption. In this paper, we tackle this issue by integrating network pruning
and quantization as a unified joint compression problem and then use AutoML to
automatically solve it. We find the pruning process can be regarded as the
channel-wise quantization with 0 bit. Thus, the separate two-step pruning and
quantization can be simplified as the one-step quantization with mixed
precision. This unification not only simplifies the compression pipeline but
also avoids the compression divergence. To implement this idea, we propose the
automated model compression by jointly applied pruning and quantization (AJPQ).
AJPQ is designed with a hierarchical architecture: the layer controller
controls the layer sparsity, and the channel controller decides the bit-width
for each kernel. Following the same importance criterion, the layer controller
and the channel controller collaboratively decide the compression strategy.
With the help of reinforcement learning, our one-step compression is
automatically achieved. Compared with the state-of-the-art automated
compression methods, our method obtains a better accuracy while reducing the
storage considerably. For fixed precision quantization, AJPQ can reduce more
than five times model size and two times computation with a slight performance
increase for Skynet in remote sensing object detection. When mixed-precision is
allowed, AJPQ can reduce five times model size with only 1.06% top-5 accuracy
decline for MobileNet in the classification task.
Related papers
- LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - Towards Hardware-Specific Automatic Compression of Neural Networks [0.0]
pruning and quantization are the major approaches to compress neural networks nowadays.
Effective compression policies consider the influence of the specific hardware architecture on the used compression methods.
We propose an algorithmic framework called Galen to search such policies using reinforcement learning utilizing pruning and quantization.
arXiv Detail & Related papers (2022-12-15T13:34:02Z) - OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization [32.60139548889592]
We propose a novel One-shot Pruning-Quantization (OPQ) in this paper.
OPQ analytically solves the compression allocation with pre-trained weight parameters only.
We propose a unified channel-wise quantization method that enforces all channels of each layer to share a common codebook.
arXiv Detail & Related papers (2022-05-23T09:05:25Z) - An Information Theory-inspired Strategy for Automatic Network Pruning [88.51235160841377]
Deep convolution neural networks are well known to be compressed on devices with resource constraints.
Most existing network pruning methods require laborious human efforts and prohibitive computation resources.
We propose an information theory-inspired strategy for automatic model compression.
arXiv Detail & Related papers (2021-08-19T07:03:22Z) - Towards Compact CNNs via Collaborative Compression [166.86915086497433]
We propose a Collaborative Compression scheme, which joints channel pruning and tensor decomposition to compress CNN models.
We achieve 52.9% FLOPs reduction by removing 48.4% parameters on ResNet-50 with only a Top-1 accuracy drop of 0.56% on ImageNet 2012.
arXiv Detail & Related papers (2021-05-24T12:07:38Z) - Single-path Bit Sharing for Automatic Loss-aware Model Compression [126.98903867768732]
Single-path Bit Sharing (SBS) is able to significantly reduce computational cost while achieving promising performance.
Our SBS compressed MobileNetV2 achieves 22.6x Bit-Operation (BOP) reduction with only 0.1% drop in the Top-1 accuracy.
arXiv Detail & Related papers (2021-01-13T08:28:21Z) - Differentiable Joint Pruning and Quantization for Hardware Efficiency [16.11027058505213]
DJPQ incorporates variational information bottleneck based structured pruning and mixed-bit precision quantization into a single differentiable loss function.
We show that DJPQ significantly reduces the number of Bit-Operations (BOPs) for several networks while maintaining the top-1 accuracy of original floating-point models.
arXiv Detail & Related papers (2020-07-20T20:45:47Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z) - Kernel Quantization for Efficient Network Compression [59.55192551370948]
Kernel Quantization (KQ) aims to efficiently convert any pre-trained full-precision convolutional neural network (CNN) model into a low-precision version without significant performance loss.
Inspired by the evolution from weight pruning to filter pruning, we propose to quantize in both kernel and weight level.
Experiments on the ImageNet classification task prove that KQ needs 1.05 and 1.62 bits on average in VGG and ResNet18, respectively, to represent each parameter in the convolution layer.
arXiv Detail & Related papers (2020-03-11T08:00:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.