Related papers: VEQ: Modality-Adaptive Quantization for MoE Vision-Language Models

VEQ: Modality-Adaptive Quantization for MoE Vision-Language Models

URL: http://arxiv.org/abs/2602.01037v1
Date: Sun, 01 Feb 2026 05:53:09 GMT
Title: VEQ: Modality-Adaptive Quantization for MoE Vision-Language Models
Authors: Guangshuo Qin, Zhiteng Li, Zheng Chen, Weihang Zhang, Linghe Kong, Yulun Zhang,
Abstract summary: Post-Training Quantization (PTQ) is an effective training-free technique to address the massive memory and computation overhead.<n>Visual Expert Quantization (VEQ) is a dual-aware quantization framework designed to accommodate cross-modal differences and heterogeneity between experts.<n>Our method achieves significant average accuracy gains of 2.04% on Kimi-VL and 3.09% on Qwen3-VL compared to the previous SOTA quantization methods.
Score: 41.557274086591924
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mixture-of-Experts(MoE) Vision-Language Models (VLMs) offer remarkable performance but incur prohibitive memory and computational costs, making compression essential. Post-Training Quantization (PTQ) is an effective training-free technique to address the massive memory and computation overhead. Existing quantization paradigms fall short as they are oblivious to two critical forms of heterogeneity: the inherent discrepancy between vision and language tokens, and the non-uniform contribution of different experts. To bridge this gap, we propose Visual Expert Quantization (VEQ), a dual-aware quantization framework designed to simultaneously accommodate cross-modal differences and heterogeneity between experts. Specifically, VEQ incorporates 1)Modality-expert-aware Quantization, which utilizes expert activation frequency to prioritize error minimization for pivotal experts, and 2)Modality-affinity-aware Quantization, which constructs an enhanced Hessian matrix by integrating token-expert affinity with modality information to guide the calibration process. Extensive experiments across diverse benchmarks verify that VEQ consistently outperforms state-of-the-art baselines. Specifically, under the W3A16 configuration, our method achieves significant average accuracy gains of 2.04\% on Kimi-VL and 3.09\% on Qwen3-VL compared to the previous SOTA quantization methods, demonstrating superior robustness across various multimodal tasks. Our code will be available at https://github.com/guangshuoqin/VEQ.

Related papers

Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization [3.6899131505284455]
Post-Training Quantization (PTQ) has emerged as an effective technique for alleviating the substantial computational and memory overheads of Vision-Language Models (VLMs)<n>We propose textbfQuant Experts (QE), a token-aware adaptive error compensation with mixture-of-experts for VLMs quantization.
arXiv Detail & Related papers (2026-02-27T14:47:48Z)
KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models [13.773876289947323]
Vector Quantization (VQ) offers a promising approach for ultra-low-bit compression in Large Language Models (LLMs)<n>We propose KBVQ-MoE, a novel VQ framework to enhance extremely low-bit quantization for MoE-based LLMs.<n> Experiments on various MoE LLMs demonstrate that KBVQ-MoE preserves accuracy substantially better than existing quantization methods.
arXiv Detail & Related papers (2026-01-30T06:57:17Z)
Qwen3-VL Technical Report [153.3964813640593]
Qwen3-VL is the most capable vision-language model to date, achieving superior performance across a broad range of multimodal benchmarks.<n>It supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video.<n>Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-token comprehension with a native 256K-token window for both text and interleaved multimodal inputs; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks
arXiv Detail & Related papers (2025-11-26T17:59:08Z)
SPEED-Q: Staged Processing with Enhanced Distillation towards Efficient Low-bit On-device VLM Quantization [6.872509247180761]
Vision-Language Models (VLMs) are crucial for enabling low-latency and privacy-preserving intelligent applications.<n>We propose SPEED-Q, a novel framework for low-bit weight-only quantization of VLM models.<n>Speedy-Q achieves up to 6x higher accuracy than existing quantization methods under 2-bit settings.
arXiv Detail & Related papers (2025-11-12T02:47:24Z)
MoPEQ: Mixture of Mixed Precision Quantized Experts [0.1262792599323502]
Mixed Precision Quantization assigns different precisions to different layers of an LLM/VLM based on layer sensitivity and importance within the model.<n>We propose a Post Training Quantization algorithm, MoPEQ, that assigns optimal bit width to each expert.<n>Our method balances accuracy and model size by analyzing each expert's sensitivity using Hessian trace approximation.
arXiv Detail & Related papers (2025-09-02T17:04:59Z)
MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z)
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z)
EAQuant: Enhancing Post-Training Quantization for MoE Models via Expert-Aware Optimization [46.40666108181214]
Mixture-of-Experts (MoE) models have emerged as a cornerstone of large-scale deep learning.<n>MoE models have inherent complexities that challenge conventional quantization techniques.<n>We propose EAQuant, a novel PTQ framework tailored for MoE architectures.
arXiv Detail & Related papers (2025-06-16T10:18:50Z)
QSpec: Speculative Decoding with Complementary Quantization Schemes [53.960146187821685]
Quantization is widely adopted to accelerate inference and reduce memory consumption in large language models (LLMs)<n>We propose QSpec, a novel quantization paradigm that decouples efficiency from quality.<n>QSpec reuses both weights and KV cache across stages, enabling near-zero-cost switching without retraining or auxiliary models.
arXiv Detail & Related papers (2024-10-15T05:57:51Z)
Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation [70.22782550540714]
Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW. We introduce a Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW.
arXiv Detail & Related papers (2024-08-07T12:42:09Z)
SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference. We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.