Mixed-Precision Inference Quantization: Radically Towards Faster
inference speed, Lower Storage requirement, and Lower Loss
- URL: http://arxiv.org/abs/2207.10083v1
- Date: Wed, 20 Jul 2022 10:55:34 GMT
- Title: Mixed-Precision Inference Quantization: Radically Towards Faster
inference speed, Lower Storage requirement, and Lower Loss
- Authors: Daning Cheng, Wenguang Chen
- Abstract summary: Existing quantization techniques rely heavily on experience and "fine-tuning" skills.
This study provides a methodology for acquiring a mixed-precise quantization model with a lower loss than the full precision model.
In particular, we will demonstrate that neural networks with massive identity mappings are resistant to the quantization method.
- Score: 4.877532217193618
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Based on the model's resilience to computational noise, model quantization is
important for compressing models and improving computing speed. Existing
quantization techniques rely heavily on experience and "fine-tuning" skills. In
the majority of instances, the quantization model has a larger loss than a full
precision model. This study provides a methodology for acquiring a
mixed-precise quantization model with a lower loss than the full precision
model. In addition, the analysis demonstrates that, throughout the inference
process, the loss function is mostly affected by the noise of the layer inputs.
In particular, we will demonstrate that neural networks with massive identity
mappings are resistant to the quantization method. It is also difficult to
improve the performance of these networks using quantization.
Related papers
- QGen: On the Ability to Generalize in Quantization Aware Training [35.0485699853394]
Quantization lowers memory usage, computational requirements, and latency by utilizing fewer bits to represent model weights and activations.
We develop a theoretical model for quantization in neural networks and demonstrate how quantization functions as a form of regularization.
arXiv Detail & Related papers (2024-04-17T21:52:21Z) - On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices.
For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator.
For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z) - Augmenting Hessians with Inter-Layer Dependencies for Mixed-Precision
Post-Training Quantization [7.392278887917975]
We propose a mixed-precision post training quantization approach that assigns different numerical precisions to tensors in a network based on their specific needs.
Our experiments demonstrate latency reductions compared to a 16-bit baseline of $25.48%$, $21.69%$, and $33.28%$ respectively.
arXiv Detail & Related papers (2023-06-08T02:18:58Z) - Ternary Quantization: A Survey [12.90416661059601]
Inference time, model size, and accuracy are critical for deploying deep neural network models.
We review the evolution of ternary quantization and investigate the relationships among existing ternary quantization methods.
arXiv Detail & Related papers (2023-03-02T03:38:51Z) - Q-Diffusion: Quantizing Diffusion Models [52.978047249670276]
Post-training quantization (PTQ) is considered a go-to compression method for other tasks.
We propose a novel PTQ method specifically tailored towards the unique multi-timestep pipeline and model architecture.
We show that our proposed method is able to quantize full-precision unconditional diffusion models into 4-bit while maintaining comparable performance.
arXiv Detail & Related papers (2023-02-08T19:38:59Z) - Mixed Precision Post Training Quantization of Neural Networks with
Sensitivity Guided Search [7.392278887917975]
Mixed-precision quantization allows different tensors to be quantized to varying levels of numerical precision.
We evaluate our method for computer vision and natural language processing and demonstrate latency reductions of up to 27.59% and 34.31%.
arXiv Detail & Related papers (2023-02-02T19:30:00Z) - Neural Networks with Quantization Constraints [111.42313650830248]
We present a constrained learning approach to quantization training.
We show that the resulting problem is strongly dual and does away with gradient estimations.
We demonstrate that the proposed approach exhibits competitive performance in image classification tasks.
arXiv Detail & Related papers (2022-10-27T17:12:48Z) - ClusterQ: Semantic Feature Distribution Alignment for Data-Free
Quantization [111.12063632743013]
We propose a new and effective data-free quantization method termed ClusterQ.
To obtain high inter-class separability of semantic features, we cluster and align the feature distribution statistics.
We also incorporate the intra-class variance to solve class-wise mode collapse.
arXiv Detail & Related papers (2022-04-30T06:58:56Z) - Post-Training Quantization for Vision Transformer [85.57953732941101]
We present an effective post-training quantization algorithm for reducing the memory storage and computational costs of vision transformers.
We can obtain an 81.29% top-1 accuracy using DeiT-B model on ImageNet dataset with about 8-bit quantization.
arXiv Detail & Related papers (2021-06-27T06:27:22Z) - Zero-shot Adversarial Quantization [11.722728148523366]
We propose a zero-shot adversarial quantization (ZAQ) framework, facilitating effective discrepancy estimation and knowledge transfer.
This is achieved by a novel two-level discrepancy modeling to drive a generator to synthesize informative and diverse data examples.
We conduct extensive experiments on three fundamental vision tasks, demonstrating the superiority of ZAQ over the strong zero-shot baselines.
arXiv Detail & Related papers (2021-03-29T01:33:34Z) - DAQ: Distribution-Aware Quantization for Deep Image Super-Resolution
Networks [49.191062785007006]
Quantizing deep convolutional neural networks for image super-resolution substantially reduces their computational costs.
Existing works either suffer from a severe performance drop in ultra-low precision of 4 or lower bit-widths, or require a heavy fine-tuning process to recover the performance.
We propose a novel distribution-aware quantization scheme (DAQ) which facilitates accurate training-free quantization in ultra-low precision.
arXiv Detail & Related papers (2020-12-21T10:19:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.