Analysis of Quantization on MLP-based Vision Models
- URL: http://arxiv.org/abs/2209.06383v1
- Date: Wed, 14 Sep 2022 02:55:57 GMT
- Title: Analysis of Quantization on MLP-based Vision Models
- Authors: Lingran Zhao, Zhen Dong, Kurt Keutzer
- Abstract summary: Quantization is taken as a model compression technique, which obtains efficient models by converting floating-point weights and activations in the neural network into lower-bit integers.
We show in the paper that directly applying quantization to bounded-based models will lead to significant accuracy.
- Score: 36.510879540365636
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Quantization is wildly taken as a model compression technique, which obtains
efficient models by converting floating-point weights and activations in the
neural network into lower-bit integers. Quantization has been proven to work
well on convolutional neural networks and transformer-based models. Despite the
decency of these models, recent works have shown that MLP-based models are able
to achieve comparable results on various tasks ranging from computer vision,
NLP to 3D point cloud, while achieving higher throughput due to the parallelism
and network simplicity. However, as we show in the paper, directly applying
quantization to MLP-based models will lead to significant accuracy degradation.
Based on our analysis, two major issues account for the accuracy gap: 1) the
range of activations in MLP-based models can be too large to quantize, and 2)
specific components in the MLP-based models are sensitive to quantization.
Consequently, we propose to 1) apply LayerNorm to control the quantization
range of activations, 2) utilize bounded activation functions, 3) apply
percentile quantization on activations, 4) use our improved module named
multiple token-mixing MLPs, and 5) apply linear asymmetric quantizer for
sensitive operations. Equipped with the abovementioned techniques, our Q-MLP
models can achieve 79.68% accuracy on ImageNet with 8-bit uniform quantization
(model size 30 MB) and 78.47% with 4-bit quantization (15 MB).
Related papers
- GWQ: Gradient-Aware Weight Quantization for Large Language Models [61.17678373122165]
gradient-aware weight quantization (GWQ) is the first quantization approach for low-bit weight quantization that leverages gradients to localize outliers.
GWQ retains the corresponding to the top 1% outliers preferentially at FP16 precision, while the remaining non-outlier weights are stored in a low-bit format.
In the zero-shot task, GWQ quantized models have higher accuracy compared to other quantization methods.
arXiv Detail & Related papers (2024-10-30T11:16:04Z) - CrossQuant: A Post-Training Quantization Method with Smaller Quantization Kernel for Precise Large Language Model Compression [6.859010157930106]
Post-Training Quantization (PTQ) is an effective technique for compressing Large Language Models (LLMs)
"quantization kernel" refers to the set of elements in activations that are quantized to zero.
We propose CrossQuant: a simple yet effective method for quantizing activations.
arXiv Detail & Related papers (2024-10-10T00:44:24Z) - SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [67.67135738642547]
Post-training quantization (PTQ) is a powerful compression technique investigated in large language models (LLMs)
Existing PTQ methods are not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths.
This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - LLM-FP4: 4-Bit Floating-Point Quantized Transformers [38.23587031169402]
We propose LLM-FP4 for quantizing both weights and activations in large language models (LLMs) down to 4-bit floating-point values.
Compared to integer quantization, floating-point (FP) quantization is more flexible and can better handle long-tail or bell-shaped distributions.
Our method, for the first time, can quantize both weights and activations in the LLaMA-13B to only 4-bit and achieves an average score of 63.1.
arXiv Detail & Related papers (2023-10-25T17:59:32Z) - ZeroQuant-V2: Exploring Post-training Quantization in LLMs from
Comprehensive Study to Low Rank Compensation [24.34969722921442]
Post-training quantization (PTQ) has emerged as a promising technique for mitigating memory consumption and computational costs in large language models (LLMs)
We conduct a comprehensive analysis of these factors by investigating the effects of PTQ on weight-only, activation-only, and weight-and-activation quantization.
We propose an optimized method called Low-Rank Compensation (LoRC) to enhance model quality recovery with a minimal increase in model size.
arXiv Detail & Related papers (2023-03-15T01:27:15Z) - Mixed Precision Post Training Quantization of Neural Networks with
Sensitivity Guided Search [7.392278887917975]
Mixed-precision quantization allows different tensors to be quantized to varying levels of numerical precision.
We evaluate our method for computer vision and natural language processing and demonstrate latency reductions of up to 27.59% and 34.31%.
arXiv Detail & Related papers (2023-02-02T19:30:00Z) - Vertical Layering of Quantized Neural Networks for Heterogeneous
Inference [57.42762335081385]
We study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one.
We can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model.
arXiv Detail & Related papers (2022-12-10T15:57:38Z) - Mixed Precision Low-bit Quantization of Neural Network Language Models
for Speech Recognition [67.95996816744251]
State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications.
Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors.
Novel mixed precision neural network LM quantization methods are proposed in this paper.
arXiv Detail & Related papers (2021-11-29T12:24:02Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z) - Learnable Companding Quantization for Accurate Low-bit Neural Networks [3.655021726150368]
Quantizing deep neural networks is an effective method for reducing memory consumption and improving inference speed.
It is still hard for extremely low-bit models to achieve accuracy comparable with that of full-precision models.
We propose learnable companding quantization (LCQ) as a novel non-uniform quantization method for 2-, 3-, and 4-bit models.
arXiv Detail & Related papers (2021-03-12T09:06:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.