VEQ: Modality-Adaptive Quantization for MoE Vision-Language Models
- URL: http://arxiv.org/abs/2602.01037v1
- Date: Sun, 01 Feb 2026 05:53:09 GMT
- Title: VEQ: Modality-Adaptive Quantization for MoE Vision-Language Models
- Authors: Guangshuo Qin, Zhiteng Li, Zheng Chen, Weihang Zhang, Linghe Kong, Yulun Zhang,
- Abstract summary: Post-Training Quantization (PTQ) is an effective training-free technique to address the massive memory and computation overhead.<n>Visual Expert Quantization (VEQ) is a dual-aware quantization framework designed to accommodate cross-modal differences and heterogeneity between experts.<n>Our method achieves significant average accuracy gains of 2.04% on Kimi-VL and 3.09% on Qwen3-VL compared to the previous SOTA quantization methods.
- Score: 41.557274086591924
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mixture-of-Experts(MoE) Vision-Language Models (VLMs) offer remarkable performance but incur prohibitive memory and computational costs, making compression essential. Post-Training Quantization (PTQ) is an effective training-free technique to address the massive memory and computation overhead. Existing quantization paradigms fall short as they are oblivious to two critical forms of heterogeneity: the inherent discrepancy between vision and language tokens, and the non-uniform contribution of different experts. To bridge this gap, we propose Visual Expert Quantization (VEQ), a dual-aware quantization framework designed to simultaneously accommodate cross-modal differences and heterogeneity between experts. Specifically, VEQ incorporates 1)Modality-expert-aware Quantization, which utilizes expert activation frequency to prioritize error minimization for pivotal experts, and 2)Modality-affinity-aware Quantization, which constructs an enhanced Hessian matrix by integrating token-expert affinity with modality information to guide the calibration process. Extensive experiments across diverse benchmarks verify that VEQ consistently outperforms state-of-the-art baselines. Specifically, under the W3A16 configuration, our method achieves significant average accuracy gains of 2.04\% on Kimi-VL and 3.09\% on Qwen3-VL compared to the previous SOTA quantization methods, demonstrating superior robustness across various multimodal tasks. Our code will be available at https://github.com/guangshuoqin/VEQ.
Related papers
- Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization [3.6899131505284455]
Post-Training Quantization (PTQ) has emerged as an effective technique for alleviating the substantial computational and memory overheads of Vision-Language Models (VLMs)<n>We propose textbfQuant Experts (QE), a token-aware adaptive error compensation with mixture-of-experts for VLMs quantization.
arXiv Detail & Related papers (2026-02-27T14:47:48Z) - KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models [13.773876289947323]
Vector Quantization (VQ) offers a promising approach for ultra-low-bit compression in Large Language Models (LLMs)<n>We propose KBVQ-MoE, a novel VQ framework to enhance extremely low-bit quantization for MoE-based LLMs.<n> Experiments on various MoE LLMs demonstrate that KBVQ-MoE preserves accuracy substantially better than existing quantization methods.
arXiv Detail & Related papers (2026-01-30T06:57:17Z) - Qwen3-VL Technical Report [153.3964813640593]
Qwen3-VL is the most capable vision-language model to date, achieving superior performance across a broad range of multimodal benchmarks.<n>It supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video.<n>Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-token comprehension with a native 256K-token window for both text and interleaved multimodal inputs; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks
arXiv Detail & Related papers (2025-11-26T17:59:08Z) - SPEED-Q: Staged Processing with Enhanced Distillation towards Efficient Low-bit On-device VLM Quantization [6.872509247180761]
Vision-Language Models (VLMs) are crucial for enabling low-latency and privacy-preserving intelligent applications.<n>We propose SPEED-Q, a novel framework for low-bit weight-only quantization of VLM models.<n>Speedy-Q achieves up to 6x higher accuracy than existing quantization methods under 2-bit settings.
arXiv Detail & Related papers (2025-11-12T02:47:24Z) - MoPEQ: Mixture of Mixed Precision Quantized Experts [0.1262792599323502]
Mixed Precision Quantization assigns different precisions to different layers of an LLM/VLM based on layer sensitivity and importance within the model.<n>We propose a Post Training Quantization algorithm, MoPEQ, that assigns optimal bit width to each expert.<n>Our method balances accuracy and model size by analyzing each expert's sensitivity using Hessian trace approximation.
arXiv Detail & Related papers (2025-09-02T17:04:59Z) - MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z) - MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z) - EAQuant: Enhancing Post-Training Quantization for MoE Models via Expert-Aware Optimization [46.40666108181214]
Mixture-of-Experts (MoE) models have emerged as a cornerstone of large-scale deep learning.<n>MoE models have inherent complexities that challenge conventional quantization techniques.<n>We propose EAQuant, a novel PTQ framework tailored for MoE architectures.
arXiv Detail & Related papers (2025-06-16T10:18:50Z) - QSpec: Speculative Decoding with Complementary Quantization Schemes [53.960146187821685]
Quantization is widely adopted to accelerate inference and reduce memory consumption in large language models (LLMs)<n>We propose QSpec, a novel quantization paradigm that decouples efficiency from quality.<n>QSpec reuses both weights and KV cache across stages, enabling near-zero-cost switching without retraining or auxiliary models.
arXiv Detail & Related papers (2024-10-15T05:57:51Z) - Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation [70.22782550540714]
Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW.
We introduce a Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW.
arXiv Detail & Related papers (2024-08-07T12:42:09Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.