Pruning Ternary Quantization
- URL: http://arxiv.org/abs/2107.10998v5
- Date: Fri, 14 Jul 2023 22:37:31 GMT
- Title: Pruning Ternary Quantization
- Authors: Dan Liu, Xi Chen, Jie Fu, Chen Ma, Xue Liu
- Abstract summary: Inference time, model size, and accuracy are three key factors in deep model compression.
We propose pruning ternary quantization (PTQ): a simple, effective, symmetric ternary quantization method.
Our method is verified on image classification, object detection/segmentation tasks with different network structures.
- Score: 32.32812780843498
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Inference time, model size, and accuracy are three key factors in deep model
compression. Most of the existing work addresses these three key factors
separately as it is difficult to optimize them all at the same time. For
example, low-bit quantization aims at obtaining a faster model; weight sharing
quantization aims at improving compression ratio and accuracy; and
mixed-precision quantization aims at balancing accuracy and inference time. To
simultaneously optimize bit-width, model size, and accuracy, we propose pruning
ternary quantization (PTQ): a simple, effective, symmetric ternary quantization
method. We integrate L2 normalization, pruning, and the weight decay term to
reduce the weight discrepancy in the gradient estimator during quantization,
thus producing highly compressed ternary weights. Our method brings the highest
test accuracy and the highest compression ratio. For example, it produces a
939kb (49$\times$) 2bit ternary ResNet-18 model with only 4\% accuracy drop on
the ImageNet dataset. It compresses 170MB Mask R-CNN to 5MB (34$\times$) with
only 2.8\% average precision drop. Our method is verified on image
classification, object detection/segmentation tasks with different network
structures such as ResNet-18, ResNet-50, and MobileNetV2.
Related papers
- 2DQuant: Low-bit Post-Training Quantization for Image Super-Resolution [83.09117439860607]
Low-bit quantization has become widespread for compressing image super-resolution (SR) models for edge deployment.
It is notorious that low-bit quantization degrades the accuracy of SR models compared to their full-precision (FP) counterparts.
We present a dual-stage low-bit post-training quantization (PTQ) method for image super-resolution, namely 2DQuant, which achieves efficient and accurate SR under low-bit quantization.
arXiv Detail & Related papers (2024-06-10T06:06:11Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - Hyperspherical Quantization: Toward Smaller and More Accurate Models [17.154801913113566]
Vector quantization aims at reducing the model size by indexing model weights with full-precision embeddings.
Binary and other low-precision quantization methods can reduce the model size up to 32$times$, however, at the cost of a considerable accuracy drop.
We propose an efficient framework for ternary quantization to produce smaller and more accurate compressed models.
arXiv Detail & Related papers (2022-12-24T04:42:15Z) - OMPQ: Orthogonal Mixed Precision Quantization [64.59700856607017]
Mixed precision quantization takes advantage of hardware's multiple bit-width arithmetic operations to unleash the full potential of network quantization.
We propose to optimize a proxy metric, the concept of networkity, which is highly correlated with the loss of the integer programming.
This approach reduces the search time and required data amount by orders of magnitude, with little compromise on quantization accuracy.
arXiv Detail & Related papers (2021-09-16T10:59:33Z) - One Model for All Quantization: A Quantized Network Supporting Hot-Swap
Bit-Width Adjustment [36.75157407486302]
We propose a method to train a model for all quantization that supports diverse bit-widths.
We use wavelet decomposition and reconstruction to increase the diversity of weights.
Our method can achieve accuracy comparable to dedicated models trained at the same precision.
arXiv Detail & Related papers (2021-05-04T08:10:50Z) - Differentiable Model Compression via Pseudo Quantization Noise [99.89011673907814]
We propose to add independent pseudo quantization noise to model parameters during training to approximate the effect of a quantization operator.
We experimentally verify that our method outperforms state-of-the-art quantization techniques on several benchmarks and architectures for image classification, language modeling, and audio source separation.
arXiv Detail & Related papers (2021-04-20T14:14:03Z) - n-hot: Efficient bit-level sparsity for powers-of-two neural network
quantization [0.0]
Powers-of-two (PoT) quantization reduces the number of bit operations of deep neural networks on resource-constrained hardware.
PoT quantization triggers a severe accuracy drop because of its limited representation ability.
We propose an efficient PoT quantization scheme that balances accuracy and costs in a memory-efficient way.
arXiv Detail & Related papers (2021-03-22T10:13:12Z) - One Weight Bitwidth to Rule Them All [24.373061354080825]
We show that using a single bitwidth for the whole network can achieve better accuracy compared to mixed-precision quantization.
Our results suggest that when the number of channels becomes a target hyper parameter, a single weight bitwidth throughout the network shows superior results for model compression.
arXiv Detail & Related papers (2020-08-22T21:40:22Z) - APQ: Joint Search for Network Architecture, Pruning and Quantization
Policy [49.3037538647714]
We present APQ for efficient deep learning inference on resource-constrained hardware.
Unlike previous methods that separately search the neural architecture, pruning policy, and quantization policy, we optimize them in a joint manner.
With the same accuracy, APQ reduces the latency/energy by 2x/1.3x over MobileNetV2+HAQ.
arXiv Detail & Related papers (2020-06-15T16:09:17Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.