MRQ:Support Multiple Quantization Schemes through Model Re-Quantization
- URL: http://arxiv.org/abs/2308.01867v2
- Date: Fri, 4 Aug 2023 02:21:38 GMT
- Title: MRQ:Support Multiple Quantization Schemes through Model Re-Quantization
- Authors: Manasa Manohara, Sankalp Dayal, Tariq Afzal, Rahul Bakshi, Kahkuen Fu
- Abstract summary: Deep learning models cannot be easily quantized for diverse fixed-point hardwares.
New type of model quantization approach called model re-quantization is proposed.
Models obtained from the re-quantization process have been successfully deployed on NNA in the Echo Show devices.
- Score: 0.17499351967216337
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the proliferation of diverse hardware accelerators (e.g., NPU, TPU,
DPU), deploying deep learning models on edge devices with fixed-point hardware
is still challenging due to complex model quantization and conversion. Existing
model quantization frameworks like Tensorflow QAT [1], TFLite PTQ [2], and
Qualcomm AIMET [3] supports only a limited set of quantization schemes (e.g.,
only asymmetric per-tensor quantization in TF1.x QAT [4]). Accordingly, deep
learning models cannot be easily quantized for diverse fixed-point hardwares,
mainly due to slightly different quantization requirements. In this paper, we
envision a new type of model quantization approach called MRQ (model
re-quantization), which takes existing quantized models and quickly transforms
the models to meet different quantization requirements (e.g., asymmetric ->
symmetric, non-power-of-2 scale -> power-of-2 scale). Re-quantization is much
simpler than quantizing from scratch because it avoids costly re-training and
provides support for multiple quantization schemes simultaneously. To minimize
re-quantization error, we developed a new set of re-quantization algorithms
including weight correction and rounding error folding. We have demonstrated
that MobileNetV2 QAT model [7] can be quickly re-quantized into two different
quantization schemes (i.e., symmetric and symmetric+power-of-2 scale) with less
than 0.64 units of accuracy loss. We believe our work is the first to leverage
this concept of re-quantization for model quantization and models obtained from
the re-quantization process have been successfully deployed on NNA in the Echo
Show devices.
Related papers
- The Quantum Imitation Game: Reverse Engineering of Quantum Machine Learning Models [2.348041867134616]
Quantum Machine Learning (QML) amalgamates quantum computing paradigms with machine learning models.
With the expansion of numerous third-party vendors in the Noisy Intermediate-Scale Quantum (NISQ) era of quantum computing, the security of QML models is of prime importance.
We assume the untrusted quantum cloud provider is an adversary having white-box access to the transpiled user-designed trained QML model during inference.
arXiv Detail & Related papers (2024-07-09T21:35:19Z) - ERQ: Error Reduction for Post-Training Quantization of Vision Transformers [48.740630807085566]
Post-training quantization (PTQ) for vision transformers (ViTs) has garnered significant attention due to its efficiency in compressing models.
We propose ERQ, a two-step PTQ approach meticulously crafted to sequentially reduce the quantization error arising from activation and weight quantization.
ERQ surpasses the state-of-the-art GPTQ by 22.36% in accuracy for W3A4 ViT-S.
arXiv Detail & Related papers (2024-07-09T12:06:03Z) - WKVQuant: Quantizing Weight and Key/Value Cache for Large Language
Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process.
This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z) - RepQuant: Towards Accurate Post-Training Quantization of Large
Transformer Models via Scale Reparameterization [8.827794405944637]
Post-training quantization (PTQ) is a promising solution for compressing large transformer models.
Existing PTQ methods typically exhibit non-trivial performance loss.
We propose RepQuant, a novel PTQ framework with quantization-inference decoupling paradigm.
arXiv Detail & Related papers (2024-02-08T12:35:41Z) - A Study of Quantisation-aware Training on Time Series Transformer Models
for Resource-constrained FPGAs [19.835810073852244]
This study explores the quantisation-aware training (QAT) on time series Transformer models.
We propose a novel adaptive quantisation scheme that dynamically selects between symmetric and asymmetric schemes during the QAT phase.
arXiv Detail & Related papers (2023-10-04T08:25:03Z) - Towards Neural Variational Monte Carlo That Scales Linearly with System
Size [67.09349921751341]
Quantum many-body problems are central to demystifying some exotic quantum phenomena, e.g., high-temperature superconductors.
The combination of neural networks (NN) for representing quantum states, and the Variational Monte Carlo (VMC) algorithm, has been shown to be a promising method for solving such problems.
We propose a NN architecture called Vector-Quantized Neural Quantum States (VQ-NQS) that utilizes vector-quantization techniques to leverage redundancies in the local-energy calculations of the VMC algorithm.
arXiv Detail & Related papers (2022-12-21T19:00:04Z) - RepQ-ViT: Scale Reparameterization for Post-Training Quantization of
Vision Transformers [2.114921680609289]
We propose RepQ-ViT, a novel PTQ framework for vision transformers (ViTs)
RepQ-ViT decouples the quantization and inference processes.
It can outperform existing strong baselines and encouragingly improve the accuracy of 4-bit PTQ of ViTs to a usable level.
arXiv Detail & Related papers (2022-12-16T02:52:37Z) - Vertical Layering of Quantized Neural Networks for Heterogeneous
Inference [57.42762335081385]
We study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one.
We can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model.
arXiv Detail & Related papers (2022-12-10T15:57:38Z) - QFT: Post-training quantization via fast joint finetuning of all degrees
of freedom [1.1744028458220428]
We rethink quantized network parameterization in HW-aware fashion, towards a unified analysis of all quantization DoF.
Our single-step simple and extendable method, dubbed quantization-aware finetuning (QFT), achieves 4-bit weight quantization results on-par with SoTA.
arXiv Detail & Related papers (2022-12-05T22:38:58Z) - Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via
Generalized Straight-Through Estimation [48.838691414561694]
Nonuniform-to-Uniform Quantization (N2UQ) is a method that can maintain the strong representation ability of nonuniform methods while being hardware-friendly and efficient.
N2UQ outperforms state-of-the-art nonuniform quantization methods by 0.71.8% on ImageNet.
arXiv Detail & Related papers (2021-11-29T18:59:55Z) - Once Quantization-Aware Training: High Performance Extremely Low-bit
Architecture Search [112.05977301976613]
We propose to combine Network Architecture Search methods with quantization to enjoy the merits of the two sides.
We first propose the joint training of architecture and quantization with a shared step size to acquire a large number of quantized models.
Then a bit-inheritance scheme is introduced to transfer the quantized models to the lower bit, which further reduces the time cost and improves the quantization accuracy.
arXiv Detail & Related papers (2020-10-09T03:52:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.