Related papers: Rethinking Practical and Efficient Quantization Calibration for Vision-Language Models

Rethinking Practical and Efficient Quantization Calibration for Vision-Language Models

URL: http://arxiv.org/abs/2602.07899v1
Date: Sun, 08 Feb 2026 10:19:25 GMT
Title: Rethinking Practical and Efficient Quantization Calibration for Vision-Language Models
Authors: Zhenhao Shang, Haizhao Jing, Guoting Wei, Haokui Zhang, Rong Xiao, Jianqing Gao, Peng Wang,
Abstract summary: Post-training quantization (PTQ) is a primary approach for deploying large language models without fine-tuning.<n>We propose the Token-level Importance-aware Layer-wise Quantization framework (TLQ)<n> TLQ is evaluated across two models, three model scales, and two quantization settings, consistently achieving performance improvements across all settings.
Score: 11.411411301593011
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Post-training quantization (PTQ) is a primary approach for deploying large language models without fine-tuning, and the quantized performance is often strongly affected by the calibration in PTQ. By contrast, in vision-language models (VLMs), substantial differences between visual and text tokens in their activation distributions and sensitivities to quantization error pose significant challenges for effective calibration during PTQ. In this work, we rethink what PTQ calibration should align with in VLMs and propose the Token-level Importance-aware Layer-wise Quantization framework (TLQ). Guided by gradient information, we design a token-level importance integration mechanism for quantization error, and use it to construct a token-level calibration set, enabling a more fine-grained calibration strategy. Furthermore, TLQ introduces a multi-GPU, quantization-exposed layer-wise calibration scheme. This scheme keeps the layer-wise calibration procedure consistent with the true quantized inference path and distributes the complex layer-wise calibration workload across multiple RTX3090 GPUs, thereby reducing reliance on the large memory of A100 GPUs. TLQ is evaluated across two models, three model scales, and two quantization settings, consistently achieving performance improvements across all settings, indicating its strong quantization stability. The code will be released publicly.

Related papers

QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models [21.01470580488428]
Vision-language-action (VLA) models unify perception, language, and control for embodied agents.<n>We introduce QuantVLA, a training-free post-training quantization framework.<n>It is the first PTQ approach for VLA systems and the first to successfully quantize a diffusion transformer (DiT) action head.
arXiv Detail & Related papers (2026-02-23T19:55:54Z)
BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models [56.504879072674015]
We propose Bit-Plane Decomposition Quantization (BPDQ), which constructs a variable quantization grid via bit-planes and scalar coefficients.<n>BPDQ enables serving Qwen2.5-72B on a single GTX 3090 with 83.85% GSM8K accuracy (vs. 90.83% at 16-bit)
arXiv Detail & Related papers (2026-02-04T02:54:37Z)
Quantized Visual Geometry Grounded Transformer [67.15451442018258]
This paper proposes the first Quantization framework for VGGTs, namely QuantVGGT.<n>We introduce Dual-Smoothed Fine-Grained Quantization, which integrates pre-global Hadamard rotation and post-local channel smoothing.<n>We also design Noise-Filtered Diverse Sampling, which filters outliers via deep-layer statistics.
arXiv Detail & Related papers (2025-09-25T15:17:11Z)
RSQ: Learning from Important Tokens Leads to Better Quantized LLMs [65.5558181902098]
Layer-wise quantization is a key technique for efficiently compressing large models without expensive retraining.<n>We propose RSQ (Rotate, Scale, then Quantize), which applies rotations to the model to mitigate outliers.<n>We demonstrate that RSQ consistently outperforms baseline methods across multiple downstream tasks and three model families.
arXiv Detail & Related papers (2025-03-03T18:46:33Z)
Rethinking Post-Training Quantization: Introducing a Statistical Pre-Calibration Approach [22.25748046511075]
Post-training Quantization (PTQ) techniques rely on calibration processes to maintain their accuracy.<n>We propose a weight-adaptive PTQ method that can be considered a precursor to calibration-based PTQ methods.<n>We show that our proposed approach can perform on par with most common calibration-based PTQ methods.
arXiv Detail & Related papers (2025-01-15T19:44:15Z)
GWQ: Gradient-Aware Weight Quantization for Large Language Models [56.22507677736051]
Large language models (LLMs) show impressive performance in solving complex language tasks.<n> compressing LLMs to low bits can enable to deploy on resource-constrained devices.<n>We propose gradient-aware weight quantization (GWQ), the first quantization approach for low-bit weight quantization.
arXiv Detail & Related papers (2024-10-30T11:16:04Z)
QSpec: Speculative Decoding with Complementary Quantization Schemes [53.960146187821685]
Quantization is widely adopted to accelerate inference and reduce memory consumption in large language models (LLMs)<n>We propose QSpec, a novel quantization paradigm that decouples efficiency from quality.<n>QSpec reuses both weights and KV cache across stages, enabling near-zero-cost switching without retraining or auxiliary models.
arXiv Detail & Related papers (2024-10-15T05:57:51Z)
OAC: Output-adaptive Calibration for Accurate Post-training Quantization [28.67781845829386]
Post-training Quantization (PTQ) techniques have been developed to compress Large Language Models (LLMs)<n>Most PTQ approaches formulate the quantization error based on a layer-wise Euclidean loss, ignoring the model output.<n>We propose Output-adaptive Quantization (OAC) to incorporate the model output in the calibration process.
arXiv Detail & Related papers (2024-05-23T20:01:17Z)
WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z)
Norm Tweaking: High-performance Low-bit Quantization of Large Language Models [21.855106896725598]
We introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision. Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations. Our simple and effective approach makes it more practical for real-world applications.
arXiv Detail & Related papers (2023-09-06T06:51:15Z)
RepQ-ViT: Scale Reparameterization for Post-Training Quantization of Vision Transformers [2.114921680609289]
We propose RepQ-ViT, a novel PTQ framework for vision transformers (ViTs) RepQ-ViT decouples the quantization and inference processes. It can outperform existing strong baselines and encouragingly improve the accuracy of 4-bit PTQ of ViTs to a usable level.
arXiv Detail & Related papers (2022-12-16T02:52:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.