Neural Network Compression using Binarization and Few Full-Precision
Weights
- URL: http://arxiv.org/abs/2306.08960v2
- Date: Fri, 15 Sep 2023 12:13:30 GMT
- Title: Neural Network Compression using Binarization and Few Full-Precision
Weights
- Authors: Franco Maria Nardini, Cosimo Rulli, Salvatore Trani, Rossano Venturini
- Abstract summary: Automatic Prune Binarization (APB) is a novel compression technique combining quantization with pruning.
APB enhances the representational capability of binary networks using a few full-precision weights.
APB delivers better accuracy/memory trade-off compared to state-of-the-art methods.
- Score: 7.206962876422061
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Quantization and pruning are two effective Deep Neural Networks model
compression methods. In this paper, we propose Automatic Prune Binarization
(APB), a novel compression technique combining quantization with pruning. APB
enhances the representational capability of binary networks using a few
full-precision weights. Our technique jointly maximizes the accuracy of the
network while minimizing its memory impact by deciding whether each weight
should be binarized or kept in full precision. We show how to efficiently
perform a forward pass through layers compressed using APB by decomposing it
into a binary and a sparse-dense matrix multiplication. Moreover, we design two
novel efficient algorithms for extremely quantized matrix multiplication on
CPU, leveraging highly efficient bitwise operations. The proposed algorithms
are 6.9x and 1.5x faster than available state-of-the-art solutions. We
extensively evaluate APB on two widely adopted model compression datasets,
namely CIFAR10 and ImageNet. APB delivers better accuracy/memory trade-off
compared to state-of-the-art methods based on i) quantization, ii) pruning, and
iii) combination of pruning and quantization. APB outperforms quantization in
the accuracy/efficiency trade-off, being up to 2x faster than the 2-bit
quantized model with no loss in accuracy.
Related papers
- Quantization-free Lossy Image Compression Using Integer Matrix Factorization [8.009813033356478]
We introduce a variant of integer matrix factorization (IMF) to develop a novel quantization-free lossy image compression method.
IMF provides a low-rank representation of the image data as a product of two smaller factor matrices with bounded integer elements.
Our method consistently outperforms JPEG at low bit rates below 0.25 bits per pixel (bpp) and remains comparable at higher bit rates.
arXiv Detail & Related papers (2024-08-22T19:08:08Z) - 2DQuant: Low-bit Post-Training Quantization for Image Super-Resolution [83.09117439860607]
Low-bit quantization has become widespread for compressing image super-resolution (SR) models for edge deployment.
It is notorious that low-bit quantization degrades the accuracy of SR models compared to their full-precision (FP) counterparts.
We present a dual-stage low-bit post-training quantization (PTQ) method for image super-resolution, namely 2DQuant, which achieves efficient and accurate SR under low-bit quantization.
arXiv Detail & Related papers (2024-06-10T06:06:11Z) - Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval.
We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z) - AdaBin: Improving Binary Neural Networks with Adaptive Binary Sets [27.022212653067367]
This paper studies the Binary Neural Networks (BNNs) in which weights and activations are both binarized into 1-bit values.
We present a simple yet effective approach called AdaBin to adaptively obtain the optimal binary sets.
Experimental results on benchmark models and datasets demonstrate that the proposed AdaBin is able to achieve state-of-the-art performance.
arXiv Detail & Related papers (2022-08-17T05:43:33Z) - OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization [32.60139548889592]
We propose a novel One-shot Pruning-Quantization (OPQ) in this paper.
OPQ analytically solves the compression allocation with pre-trained weight parameters only.
We propose a unified channel-wise quantization method that enforces all channels of each layer to share a common codebook.
arXiv Detail & Related papers (2022-05-23T09:05:25Z) - Mixed Precision of Quantization of Transformer Language Models for
Speech Recognition [67.95996816744251]
State-of-the-art neural language models represented by Transformers are becoming increasingly complex and expensive for practical applications.
Current low-bit quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of the system to quantization errors.
The optimal local precision settings are automatically learned using two techniques.
Experiments conducted on Penn Treebank (PTB) and a Switchboard corpus trained LF-MMI TDNN system.
arXiv Detail & Related papers (2021-11-29T09:57:00Z) - Dynamic Probabilistic Pruning: A general framework for
hardware-constrained pruning at different granularities [80.06422693778141]
We propose a flexible new pruning mechanism that facilitates pruning at different granularities (weights, kernels, filters/feature maps)
We refer to this algorithm as Dynamic Probabilistic Pruning (DPP)
We show that DPP achieves competitive compression rates and classification accuracy when pruning common deep learning models trained on different benchmark datasets for image classification.
arXiv Detail & Related papers (2021-05-26T17:01:52Z) - Single-path Bit Sharing for Automatic Loss-aware Model Compression [126.98903867768732]
Single-path Bit Sharing (SBS) is able to significantly reduce computational cost while achieving promising performance.
Our SBS compressed MobileNetV2 achieves 22.6x Bit-Operation (BOP) reduction with only 0.1% drop in the Top-1 accuracy.
arXiv Detail & Related papers (2021-01-13T08:28:21Z) - Fully Quantized Image Super-Resolution Networks [81.75002888152159]
We propose a Fully Quantized image Super-Resolution framework (FQSR) to jointly optimize efficiency and accuracy.
We apply our quantization scheme on multiple mainstream super-resolution architectures, including SRResNet, SRGAN and EDSR.
Our FQSR using low bits quantization can achieve on par performance compared with the full-precision counterparts on five benchmark datasets.
arXiv Detail & Related papers (2020-11-29T03:53:49Z) - Differentiable Joint Pruning and Quantization for Hardware Efficiency [16.11027058505213]
DJPQ incorporates variational information bottleneck based structured pruning and mixed-bit precision quantization into a single differentiable loss function.
We show that DJPQ significantly reduces the number of Bit-Operations (BOPs) for several networks while maintaining the top-1 accuracy of original floating-point models.
arXiv Detail & Related papers (2020-07-20T20:45:47Z) - Efficient Bitwidth Search for Practical Mixed Precision Neural Network [33.80117489791902]
Network quantization has rapidly become one of the most widely used methods to compress and accelerate deep neural networks.
Recent efforts propose to quantize weights and activations from different layers with different precision to improve the overall performance.
It is challenging to find the optimal bitwidth (i.e., precision) for weights and activations of each layer efficiently.
It is yet unclear how to perform convolution for weights and activations of different precision efficiently on generic hardware platforms.
arXiv Detail & Related papers (2020-03-17T08:27:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.