Related papers: GPLQ: A General, Practical, and Lightning QAT Method for Vision Transformers

GPLQ: A General, Practical, and Lightning QAT Method for Vision Transformers

URL: http://arxiv.org/abs/2506.11784v1
Date: Fri, 13 Jun 2025 13:45:17 GMT
Title: GPLQ: A General, Practical, and Lightning QAT Method for Vision Transformers
Authors: Guang Liang, Xinyao Liu, Jianxin Wu,
Abstract summary: Vision Transformers (ViTs) are essential in computer vision but are computationally intensive, too.<n>Model quantization aims to alleviate this difficulty, yet existing Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) methods exhibit significant limitations.<n>This paper introduces General, Practical, and Quantization (GPLQ), a novel framework for efficient ViT quantization.
Score: 11.452135395287119
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision Transformers (ViTs) are essential in computer vision but are computationally intensive, too. Model quantization, particularly to low bit-widths like 4-bit, aims to alleviate this difficulty, yet existing Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) methods exhibit significant limitations. PTQ often incurs substantial accuracy drop, while QAT achieves high accuracy but suffers from prohibitive computational costs, limited generalization to downstream tasks, training instability, and lacking of open-source codebase. To address these challenges, this paper introduces General, Practical, and Lightning Quantization (GPLQ), a novel framework designed for efficient and effective ViT quantization. GPLQ is founded on two key empirical insights: the paramount importance of activation quantization and the necessity of preserving the model's original optimization ``basin'' to maintain generalization. Consequently, GPLQ employs a sequential ``activation-first, weights-later'' strategy. Stage 1 keeps weights in FP32 while quantizing activations with a feature mimicking loss in only 1 epoch to keep it stay in the same ``basin'', thereby preserving generalization. Stage 2 quantizes weights using a PTQ method. As a result, GPLQ is 100x faster than existing QAT methods, lowers memory footprint to levels even below FP32 training, and achieves 4-bit model performance that is highly competitive with FP32 models in terms of both accuracy on ImageNet and generalization to diverse downstream tasks, including fine-grained visual classification and object detection. We will release an easy-to-use open-source toolkit supporting multiple vision tasks.

Related papers

PoTPTQ: A Two-step Power-of-Two Post-training for LLMs [27.141872509108122]
Large Language Models (LLMs) have demonstrated remarkable performance across various natural language processing (NLP) tasks.<n>Power-of-two (PoT) quantization is a general tool to counteract this difficulty.<n>We propose a novel POT quantization framework for LLM weights that (i) outperforms state-of-the-art accuracy in extremely low-precision number formats, and (ii) enables faster inference through more efficient dequantization.
arXiv Detail & Related papers (2025-07-16T06:44:14Z)
Post-Training Quantization for Video Matting [20.558324038808664]
Video matting is crucial for applications such as film production and virtual reality.<n>Post-Training Quantization (PTQ) is still in its nascent stages for video matting.<n>This paper proposes a novel and general PTQ framework specifically designed for video matting models.
arXiv Detail & Related papers (2025-06-12T15:57:14Z)
Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression [55.323397702682506]
Post-training quantization (PTQ) reduces a model's memory footprint by mapping full precision weights into low bit weights without costly retraining.<n>We develop a new mixed-precision PTQ approach, Task-Circuit Quantization (TaCQ), that draws parallels to automated circuit discovery.
arXiv Detail & Related papers (2025-04-10T02:19:03Z)
GWQ: Gradient-Aware Weight Quantization for Large Language Models [56.22507677736051]
Large language models (LLMs) show impressive performance in solving complex language tasks.<n> compressing LLMs to low bits can enable to deploy on resource-constrained devices.<n>We propose gradient-aware weight quantization (GWQ), the first quantization approach for low-bit weight quantization.
arXiv Detail & Related papers (2024-10-30T11:16:04Z)
AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer [54.713778961605115]
Vision Transformer (ViT) has become one of the most prevailing fundamental backbone networks in the computer vision community. We propose a novel non-uniform quantizer, dubbed the Adaptive Logarithm AdaLog (AdaLog) quantizer.
arXiv Detail & Related papers (2024-07-17T18:38:48Z)
Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization [62.15918574997175]
It is known that language models contain outlier channels whose values on average are orders of magnitude higher than other channels. We propose a strategy which regularizes a layer's inputs via quantization-aware training (QAT) and its outputs via activation kurtosis regularization. We show that regularizing both the inputs and outputs is crucial for preventing a model's "migrating" the difficulty in input quantization to the weights.
arXiv Detail & Related papers (2024-04-04T17:25:30Z)
LiDAR-PTQ: Post-Training Quantization for Point Cloud 3D Object Detection [35.35457515189062]
Post-Training Quantization (PTQ) has been widely adopted in 2D vision tasks. LiDAR-PTQ can achieve state-of-the-art quantization performance when applied to CenterPoint. LiDAR-PTQ is cost-effective being $30times$ faster than the quantization-aware training method.
arXiv Detail & Related papers (2024-01-29T03:35:55Z)
QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models [57.04178959678024]
We show that the majority of inference computations for large generative models can be performed with both weights and activations being cast to 4 bits. We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit. We provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x.
arXiv Detail & Related papers (2023-10-13T17:15:05Z)
Norm Tweaking: High-performance Low-bit Quantization of Large Language Models [21.855106896725598]
We introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision. Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations. Our simple and effective approach makes it more practical for real-world applications.
arXiv Detail & Related papers (2023-09-06T06:51:15Z)
A2Q: Accumulator-Aware Quantization with Guaranteed Overflow Avoidance [49.1574468325115]
accumulator-aware quantization (A2Q) is a novel weight quantization method designed to train quantized neural networks (QNNs) to avoid overflow during inference. A2Q introduces a unique formulation inspired by weight normalization that constrains the L1-norm of model weights according to accumulator bit width bounds. We show A2Q can train QNNs for low-precision accumulators while maintaining model accuracy competitive with a floating-point baseline.
arXiv Detail & Related papers (2023-08-25T17:28:58Z)
RepQ-ViT: Scale Reparameterization for Post-Training Quantization of Vision Transformers [2.114921680609289]
We propose RepQ-ViT, a novel PTQ framework for vision transformers (ViTs) RepQ-ViT decouples the quantization and inference processes. It can outperform existing strong baselines and encouragingly improve the accuracy of 4-bit PTQ of ViTs to a usable level.
arXiv Detail & Related papers (2022-12-16T02:52:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.