Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization
- URL: http://arxiv.org/abs/2602.24059v1
- Date: Fri, 27 Feb 2026 14:47:48 GMT
- Title: Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization
- Authors: Chenwei Jia, Baoting Li, Xuchong Zhang, Mingzhuo Wei, Bochen Lin, Hongbin Sun,
- Abstract summary: Post-Training Quantization (PTQ) has emerged as an effective technique for alleviating the substantial computational and memory overheads of Vision-Language Models (VLMs)<n>We propose textbfQuant Experts (QE), a token-aware adaptive error compensation with mixture-of-experts for VLMs quantization.
- Score: 3.6899131505284455
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Post-Training Quantization (PTQ) has emerged as an effective technique for alleviating the substantial computational and memory overheads of Vision-Language Models (VLMs) by compressing both weights and activations without retraining the full model. Existing PTQ methods primarily rely on static identification and global compensation of sensitive or outlier channels, yet they often overlook the distributional differences of these important channels across inputs, leading to unsatisfactory quantization. In this work, we observe that the distributions and occurrence frequencies of important channels vary significantly both across modalities and among tokens, even within the same modality. Accordingly, we propose \textbf{Quant Experts (QE)}, a token-aware adaptive error compensation with mixture-of-experts for VLMs quantization. QE divides the important channels into token-independent and token-dependent groups. For the former, a shared expert is designed for most tokens to compensate for global quantization error using a low-rank adapter. For the latter, routed experts including multiple routed low-rank adapters are elaborated to compensate for local quantization error related to specific tokens. Extensive experiments demonstrate that QE consistently enhances task accuracy across various quantization settings and model scales, ranging from 2B to 70B parameters, while maintaining performance comparable to full-precision models.
Related papers
- A Replicate-and-Quantize Strategy for Plug-and-Play Load Balancing of Sparse Mixture-of-Experts LLMs [64.8510381475827]
Sparse Mixture-of-Experts (SMoE) architectures are increasingly used to scale large language models efficiently.<n>SMoE models often suffer from severe load imbalance across experts, where a small subset of experts receives most tokens while others are underutilized.<n>We present a systematic analysis of expert routing during inference and identify three findings: (i) load imbalance persists and worsens with larger batch sizes, (ii) selection frequency does not reliably reflect expert importance, and (iii) overall expert workload and importance can be estimated using a small calibration set.
arXiv Detail & Related papers (2026-02-23T15:11:16Z) - Rethinking Practical and Efficient Quantization Calibration for Vision-Language Models [11.411411301593011]
Post-training quantization (PTQ) is a primary approach for deploying large language models without fine-tuning.<n>We propose the Token-level Importance-aware Layer-wise Quantization framework (TLQ)<n> TLQ is evaluated across two models, three model scales, and two quantization settings, consistently achieving performance improvements across all settings.
arXiv Detail & Related papers (2026-02-08T10:19:25Z) - VEQ: Modality-Adaptive Quantization for MoE Vision-Language Models [41.557274086591924]
Post-Training Quantization (PTQ) is an effective training-free technique to address the massive memory and computation overhead.<n>Visual Expert Quantization (VEQ) is a dual-aware quantization framework designed to accommodate cross-modal differences and heterogeneity between experts.<n>Our method achieves significant average accuracy gains of 2.04% on Kimi-VL and 3.09% on Qwen3-VL compared to the previous SOTA quantization methods.
arXiv Detail & Related papers (2026-02-01T05:53:09Z) - KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models [13.773876289947323]
Vector Quantization (VQ) offers a promising approach for ultra-low-bit compression in Large Language Models (LLMs)<n>We propose KBVQ-MoE, a novel VQ framework to enhance extremely low-bit quantization for MoE-based LLMs.<n> Experiments on various MoE LLMs demonstrate that KBVQ-MoE preserves accuracy substantially better than existing quantization methods.
arXiv Detail & Related papers (2026-01-30T06:57:17Z) - Least-Loaded Expert Parallelism: Load Balancing An Imbalanced Mixture-of-Experts [74.40169987564724]
Expert parallelism (EP) is designed to scale MoE models by distributing experts across multiple devices.<n>Under extreme imbalance, EP can funnel a disproportionate number of tokens to a small number of experts, leading to compute- and memory-bound failures.<n>We propose Least-Loaded Expert Parallelism (LLEP), a novel EP algorithm that dynamically reroutes excess tokens and associated expert parameters from overloaded devices to underutilized ones.
arXiv Detail & Related papers (2026-01-23T18:19:15Z) - MoEQuant: Enhancing Quantization for Mixture-of-Experts Large Language Models via Expert-Balanced Sampling and Affinity Guidance [10.817003682434425]
Mixture-of-Experts (MoE) large language models (LLMs) leverage dynamic routing and sparse activation to enhance efficiency and scalability.<n>Post-training quantization (PTQ) encounters severe accuracy degradation and diminished performance when applied to MoE models.<n>This paper investigates the impact of MoE's sparse and dynamic characteristics on quantization.
arXiv Detail & Related papers (2025-05-02T08:51:55Z) - QSpec: Speculative Decoding with Complementary Quantization Schemes [53.960146187821685]
Quantization is widely adopted to accelerate inference and reduce memory consumption in large language models (LLMs)<n>We propose QSpec, a novel quantization paradigm that decouples efficiency from quality.<n>QSpec reuses both weights and KV cache across stages, enabling near-zero-cost switching without retraining or auxiliary models.
arXiv Detail & Related papers (2024-10-15T05:57:51Z) - Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners [51.32182730502002]
We introduce Singular-value Diagonal Expansion to refine weight distributions to achieve better quantization alignment.<n>Our plug-and-play weight-quantization methods demonstrate substantial performance improvements over state-of-the-art approaches.
arXiv Detail & Related papers (2024-07-22T09:45:16Z) - Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization [62.15918574997175]
It is known that language models contain outlier channels whose values on average are orders of magnitude higher than other channels.
We propose a strategy which regularizes a layer's inputs via quantization-aware training (QAT) and its outputs via activation kurtosis regularization.
We show that regularizing both the inputs and outputs is crucial for preventing a model's "migrating" the difficulty in input quantization to the weights.
arXiv Detail & Related papers (2024-04-04T17:25:30Z) - Overcoming Distribution Mismatch in Quantizing Image Super-Resolution Networks [53.23803932357899]
quantization leads to accuracy loss in image super-resolution (SR) networks.
Existing works address this distribution mismatch problem by dynamically adapting quantization ranges during test time.
We propose a new quantization-aware training scheme that effectively Overcomes the Distribution Mismatch problem in SR networks.
arXiv Detail & Related papers (2023-07-25T08:50:01Z) - Automatic Mixed-Precision Quantization Search of BERT [62.65905462141319]
Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks.
These models usually contain millions of parameters, which prevents them from practical deployment on resource-constrained devices.
We propose an automatic mixed-precision quantization framework designed for BERT that can simultaneously conduct quantization and pruning in a subgroup-wise level.
arXiv Detail & Related papers (2021-12-30T06:32:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.