RepQuant: Towards Accurate Post-Training Quantization of Large
Transformer Models via Scale Reparameterization
- URL: http://arxiv.org/abs/2402.05628v1
- Date: Thu, 8 Feb 2024 12:35:41 GMT
- Title: RepQuant: Towards Accurate Post-Training Quantization of Large
Transformer Models via Scale Reparameterization
- Authors: Zhikai Li, Xuewen Liu, Jing Zhang, and Qingyi Gu
- Abstract summary: Post-training quantization (PTQ) is a promising solution for compressing large transformer models.
Existing PTQ methods typically exhibit non-trivial performance loss.
We propose RepQuant, a novel PTQ framework with quantization-inference decoupling paradigm.
- Score: 8.827794405944637
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large transformer models have demonstrated remarkable success. Post-training
quantization (PTQ), which requires only a small dataset for calibration and
avoids end-to-end retraining, is a promising solution for compressing these
large models. Regrettably, existing PTQ methods typically exhibit non-trivial
performance loss. We find that the performance bottleneck stems from
over-consideration of hardware compatibility in the quantization process,
compelling them to reluctantly employ simple quantizers, albeit at the expense
of accuracy. With the above insights, we propose RepQuant, a novel PTQ
framework with quantization-inference decoupling paradigm to address the above
issues. RepQuant employs complex quantizers in the quantization process and
simplified quantizers in the inference process, and performs mathematically
equivalent transformations between the two through quantization scale
reparameterization, thus ensuring both accurate quantization and efficient
inference. More specifically, we focus on two components with extreme
distributions: LayerNorm activations and Softmax activations. Initially, we
apply channel-wise quantization and log$\sqrt{2}$ quantization, respectively,
which are tailored to their distributions. In particular, for the former, we
introduce a learnable per-channel dual clipping scheme, which is designed to
efficiently identify outliers in the unbalanced activations with fine
granularity. Then, we reparameterize the scales to hardware-friendly layer-wise
quantization and log2 quantization for inference. Moreover, quantized weight
reconstruction is seamlessly integrated into the above procedure to further
push the performance limits. Extensive experiments are performed on different
large-scale transformer variants on multiple tasks, including vision, language,
and multi-modal transformers, and RepQuant encouragingly demonstrates
significant performance advantages.
Related papers
- Scalable quantum dynamics compilation via quantum machine learning [7.31922231703204]
variational quantum compilation (VQC) methods employ variational optimization to reduce gate costs while maintaining high accuracy.
We show that our approach exceeds state-of-the-art compilation results in both system size and accuracy in one dimension ($1$D)
For the first time, we extend VQC to systems on two-dimensional (2D) strips with a quasi-1D treatment, demonstrating a significant resource advantage over standard Trotterization methods.
arXiv Detail & Related papers (2024-09-24T18:00:00Z) - WKVQuant: Quantizing Weight and Key/Value Cache for Large Language
Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process.
This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z) - MRQ:Support Multiple Quantization Schemes through Model Re-Quantization [0.17499351967216337]
Deep learning models cannot be easily quantized for diverse fixed-point hardwares.
New type of model quantization approach called model re-quantization is proposed.
Models obtained from the re-quantization process have been successfully deployed on NNA in the Echo Show devices.
arXiv Detail & Related papers (2023-08-01T08:15:30Z) - PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language
Models [52.09865918265002]
We propose a novel quantize before fine-tuning'' framework, PreQuant.
PreQuant is compatible with various quantization strategies, with outlier-aware fine-tuning incorporated to correct the induced quantization error.
We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5.
arXiv Detail & Related papers (2023-05-30T08:41:33Z) - RepQ-ViT: Scale Reparameterization for Post-Training Quantization of
Vision Transformers [2.114921680609289]
We propose RepQ-ViT, a novel PTQ framework for vision transformers (ViTs)
RepQ-ViT decouples the quantization and inference processes.
It can outperform existing strong baselines and encouragingly improve the accuracy of 4-bit PTQ of ViTs to a usable level.
arXiv Detail & Related papers (2022-12-16T02:52:37Z) - NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization
for Vision Transformers [53.85087932591237]
NoisyQuant is a quantizer-agnostic enhancement for the post-training activation quantization performance of vision transformers.
Building on the theoretical insight, NoisyQuant achieves the first success on actively altering the heavy-tailed activation distribution.
NoisyQuant largely improves the post-training quantization performance of vision transformer with minimal computation overhead.
arXiv Detail & Related papers (2022-11-29T10:02:09Z) - Understanding and Overcoming the Challenges of Efficient Transformer
Quantization [17.05322956052278]
Transformer-based architectures have become the de-facto standard models for a wide range of Natural Language Processing tasks.
However, their memory footprint and high latency are prohibitive for efficient deployment and inference on resource-limited devices.
We show that transformers have unique quantization challenges -- namely, high dynamic activation ranges that are difficult to represent with a low bit fixed-point format.
arXiv Detail & Related papers (2021-09-27T10:57:18Z) - Post-Training Quantization for Vision Transformer [85.57953732941101]
We present an effective post-training quantization algorithm for reducing the memory storage and computational costs of vision transformers.
We can obtain an 81.29% top-1 accuracy using DeiT-B model on ImageNet dataset with about 8-bit quantization.
arXiv Detail & Related papers (2021-06-27T06:27:22Z) - Fully Quantized Image Super-Resolution Networks [81.75002888152159]
We propose a Fully Quantized image Super-Resolution framework (FQSR) to jointly optimize efficiency and accuracy.
We apply our quantization scheme on multiple mainstream super-resolution architectures, including SRResNet, SRGAN and EDSR.
Our FQSR using low bits quantization can achieve on par performance compared with the full-precision counterparts on five benchmark datasets.
arXiv Detail & Related papers (2020-11-29T03:53:49Z) - Gradient $\ell_1$ Regularization for Quantization Robustness [70.39776106458858]
We derive a simple regularization scheme that improves robustness against post-training quantization.
By training quantization-ready networks, our approach enables storing a single set of weights that can be quantized on-demand to different bit-widths.
arXiv Detail & Related papers (2020-02-18T12:31:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.