Automatic Pruning for Quantized Neural Networks
- URL: http://arxiv.org/abs/2002.00523v1
- Date: Mon, 3 Feb 2020 01:10:13 GMT
- Title: Automatic Pruning for Quantized Neural Networks
- Authors: Luis Guerra, Bohan Zhuang, Ian Reid, Tom Drummond
- Abstract summary: We propose an effective pruning strategy for selecting redundant low-precision filters.
We conduct extensive experiments on CIFAR-10 and ImageNet with various architectures and precisions.
For ResNet-18 on ImageNet, we prune 26.12% of the model size with Binarized Neural Network quantization.
- Score: 35.2752928147013
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural network quantization and pruning are two techniques commonly used to
reduce the computational complexity and memory footprint of these models for
deployment. However, most existing pruning strategies operate on full-precision
and cannot be directly applied to discrete parameter distributions after
quantization. In contrast, we study a combination of these two techniques to
achieve further network compression. In particular, we propose an effective
pruning strategy for selecting redundant low-precision filters. Furthermore, we
leverage Bayesian optimization to efficiently determine the pruning ratio for
each layer. We conduct extensive experiments on CIFAR-10 and ImageNet with
various architectures and precisions. In particular, for ResNet-18 on ImageNet,
we prune 26.12% of the model size with Binarized Neural Network quantization,
achieving a top-1 classification accuracy of 47.32% in a model of 2.47 MB and
59.30% with a 2-bit DoReFa-Net in 4.36 MB.
Related papers
- Joint Pruning and Channel-wise Mixed-Precision Quantization for Efficient Deep Neural Networks [10.229120811024162]
deep neural networks (DNNs) pose significant challenges to their deployment on edge devices.
Common approaches to address this issue are pruning and mixed-precision quantization.
We propose a novel methodology to apply them jointly via a lightweight gradient-based search.
arXiv Detail & Related papers (2024-07-01T08:07:02Z) - FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search [50.07268323597872]
We propose the first one-shot mixed-precision quantization search that eliminates the need for retraining in both integer and low-precision floating point models.
With integer models, we increase the accuracy of ResNet-18 on ImageNet by 1.31% and ResNet-50 by 0.90% with equivalent model cost over previous methods.
For the first time, we explore a novel mixed-precision floating-point search and improve MobileNetV2 by up to 0.98% compared to prior state-of-the-art FP8 models.
arXiv Detail & Related papers (2023-08-07T04:17:19Z) - Resource Efficient Neural Networks Using Hessian Based Pruning [7.042897867094235]
We modify the existing approach by estimating the Hessian trace using FP16 precision instead of FP32.
Our modified approach can achieve speed ups ranging from 17% to as much as 44% during our experiments on different combinations of model architectures and GPU devices.
We also present the results of pruning using both FP16 and FP32 Hessian trace calculation and show that there are no noticeable accuracy differences between the two.
arXiv Detail & Related papers (2023-06-12T11:09:16Z) - Automatic Network Adaptation for Ultra-Low Uniform-Precision
Quantization [6.1664476076961146]
Uniform-precision neural network quantization has gained popularity since it simplifies densely packed arithmetic unit for high computing capability.
It ignores heterogeneous sensitivity to the impact of quantization errors across the layers, resulting in sub-optimal inference.
This work proposes a novel neural architecture search called neural channel expansion that adjusts the network structure to alleviate accuracy degradation from ultra-low uniform-precision quantization.
arXiv Detail & Related papers (2022-12-21T09:41:25Z) - Compact representations of convolutional neural networks via weight
pruning and quantization [63.417651529192014]
We propose a novel storage format for convolutional neural networks (CNNs) based on source coding and leveraging both weight pruning and quantization.
We achieve a reduction of space occupancy up to 0.6% on fully connected layers and 5.44% on the whole network, while performing at least as competitive as the baseline.
arXiv Detail & Related papers (2021-08-28T20:39:54Z) - Manifold Regularized Dynamic Network Pruning [102.24146031250034]
This paper proposes a new paradigm that dynamically removes redundant filters by embedding the manifold information of all instances into the space of pruned networks.
The effectiveness of the proposed method is verified on several benchmarks, which shows better performance in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2021-03-10T03:59:03Z) - Single-path Bit Sharing for Automatic Loss-aware Model Compression [126.98903867768732]
Single-path Bit Sharing (SBS) is able to significantly reduce computational cost while achieving promising performance.
Our SBS compressed MobileNetV2 achieves 22.6x Bit-Operation (BOP) reduction with only 0.1% drop in the Top-1 accuracy.
arXiv Detail & Related papers (2021-01-13T08:28:21Z) - Searching for Low-Bit Weights in Quantized Neural Networks [129.8319019563356]
Quantized neural networks with low-bit weights and activations are attractive for developing AI accelerators.
We present to regard the discrete weights in an arbitrary quantized neural network as searchable variables, and utilize a differential method to search them accurately.
arXiv Detail & Related papers (2020-09-18T09:13:26Z) - Differentiable Joint Pruning and Quantization for Hardware Efficiency [16.11027058505213]
DJPQ incorporates variational information bottleneck based structured pruning and mixed-bit precision quantization into a single differentiable loss function.
We show that DJPQ significantly reduces the number of Bit-Operations (BOPs) for several networks while maintaining the top-1 accuracy of original floating-point models.
arXiv Detail & Related papers (2020-07-20T20:45:47Z) - Widening and Squeezing: Towards Accurate and Efficient QNNs [125.172220129257]
Quantization neural networks (QNNs) are very attractive to the industry because their extremely cheap calculation and storage overhead, but their performance is still worse than that of networks with full-precision parameters.
Most of existing methods aim to enhance performance of QNNs especially binary neural networks by exploiting more effective training techniques.
We address this problem by projecting features in original full-precision networks to high-dimensional quantization features.
arXiv Detail & Related papers (2020-02-03T04:11:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.