SPEED-Q: Staged Processing with Enhanced Distillation towards Efficient Low-bit On-device VLM Quantization
- URL: http://arxiv.org/abs/2511.08914v1
- Date: Thu, 13 Nov 2025 01:17:22 GMT
- Title: SPEED-Q: Staged Processing with Enhanced Distillation towards Efficient Low-bit On-device VLM Quantization
- Authors: Tianyu Guo, Shanwei Zhao, Shiai Zhu, Chenguang Ma,
- Abstract summary: Vision-Language Models (VLMs) are crucial for enabling low-latency and privacy-preserving intelligent applications.<n>We propose SPEED-Q, a novel framework for low-bit weight-only quantization of VLM models.<n>Speedy-Q achieves up to 6x higher accuracy than existing quantization methods under 2-bit settings.
- Score: 6.872509247180761
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deploying Vision-Language Models (VLMs) on edge devices (e.g., smartphones and robots) is crucial for enabling low-latency and privacy-preserving intelligent applications. Given the resource constraints of these devices, quantization offers a promising solution by improving memory efficiency and reducing bandwidth requirements, thereby facilitating the deployment of VLMs. However, existing research has rarely explored aggressive quantization on VLMs, particularly for the models ranging from 1B to 2B parameters, which are more suitable for resource-constrained edge devices. In this paper, we propose SPEED-Q, a novel Staged Processing with Enhanced Distillation framework for VLM low-bit weight-only quantization that systematically addresses the following two critical obstacles: (1) significant discrepancies in quantization sensitivity between vision (ViT) and language (LLM) components in VLMs; (2) training instability arising from the reduced numerical precision inherent in low-bit quantization. In SPEED-Q, a staged sensitivity adaptive mechanism is introduced to effectively harmonize performance across different modalities. We further propose a distillation-enhanced quantization strategy to stabilize the training process and reduce data dependence. Together, SPEED-Q enables accurate, stable, and data-efficient quantization of complex VLMs. SPEED-Q is the first framework tailored for quantizing entire small-scale billion-parameter VLMs to low bits. Extensive experiments across multiple benchmarks demonstrate that SPEED-Q achieves up to 6x higher accuracy than existing quantization methods under 2-bit settings and consistently outperforms prior on-device VLMs under both 2-bit and 4-bit settings. Our code and models are available at https://github.com/antgroup/SPEED-Q.
Related papers
- SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs [4.946856266233001]
SignRoundV2 is a post-training quantization framework that is highly effective even without mixed-precision.<n>Our method sustains competitive accuracy for Large Language Models, achieving production-grade performance with about 1 percent variance at 4-5 bits.
arXiv Detail & Related papers (2025-12-04T12:35:10Z) - Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models [97.55009021098554]
This work aims to identify the key determinants of SLMs' real-device latency and offer generalizable principles and methodologies for SLM design and training.<n>We introduce a new family of hybrid SLMs, called Nemotron-Flash, which significantly advances the accuracy-efficiency frontier of state-of-the-art SLMs.
arXiv Detail & Related papers (2025-11-24T08:46:36Z) - RSAVQ: Riemannian Sensitivity-Aware Vector Quantization for Large Language Models [17.273189597394072]
Large language models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks.<n>Their exponentially increasing parameters pose significant challenges for deployment on resource-constrained devices.<n>We propose RSAVQ, a novel framework to enhance extremely low-bit quantization for LLMs.
arXiv Detail & Related papers (2025-09-24T01:40:32Z) - VLMQ: Efficient Post-Training Quantization for Large Vision-Language Models via Hessian Augmentation [8.891793681316992]
Post-training quantization (PTQ) has emerged as an effective approach for compressing large models and accelerating their inference without retraining.<n>While PTQ has been extensively studied in the context of large language models (LLMs), its applicability to vision-language models (VLMs) remains underexplored.<n>We propose a novel importance-aware PTQ framework tailored for VLMs, dubbed VLMQ.
arXiv Detail & Related papers (2025-08-05T11:57:03Z) - MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z) - QSpec: Speculative Decoding with Complementary Quantization Schemes [53.960146187821685]
Quantization is widely adopted to accelerate inference and reduce memory consumption in large language models (LLMs)<n>We propose QSpec, a novel quantization paradigm that decouples efficiency from quality.<n>QSpec reuses both weights and KV cache across stages, enabling near-zero-cost switching without retraining or auxiliary models.
arXiv Detail & Related papers (2024-10-15T05:57:51Z) - AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer [54.713778961605115]
Vision Transformer (ViT) has become one of the most prevailing fundamental backbone networks in the computer vision community.
We propose a novel non-uniform quantizer, dubbed the Adaptive Logarithm AdaLog (AdaLog) quantizer.
arXiv Detail & Related papers (2024-07-17T18:38:48Z) - ADFQ-ViT: Activation-Distribution-Friendly Post-Training Quantization for Vision Transformers [7.155242379236052]
Quantization of Vision Transformers (ViTs) has emerged as a promising solution to mitigate these challenges.
Existing methods still suffer from significant accuracy loss at low-bit.
ADFQ-ViT provides significant improvements over various baselines in image classification, object detection, and instance segmentation tasks at 4-bit.
arXiv Detail & Related papers (2024-07-03T02:41:59Z) - SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [63.118592279833656]
Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs)<n>We propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise.<n> Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - DB-LLM: Accurate Dual-Binarization for Efficient LLMs [83.70686728471547]
Large language models (LLMs) have significantly advanced the field of natural language processing.
Existing ultra-low-bit quantization always causes severe accuracy drops.
We propose a novel Dual-Binarization method for LLMs, namely DB-LLM.
arXiv Detail & Related papers (2024-02-19T09:04:30Z) - I&S-ViT: An Inclusive & Stable Method for Pushing the Limit of Post-Training ViTs Quantization [70.5455407146695]
We introduce I&S-ViT, a novel method that regulates the PTQ of ViTs in an inclusive and stable fashion.<n>I&S-ViT elevates the performance of 3-bit ViT-B by an impressive 50.68%.
arXiv Detail & Related papers (2023-11-16T13:07:47Z) - RepQ-ViT: Scale Reparameterization for Post-Training Quantization of
Vision Transformers [2.114921680609289]
We propose RepQ-ViT, a novel PTQ framework for vision transformers (ViTs)
RepQ-ViT decouples the quantization and inference processes.
It can outperform existing strong baselines and encouragingly improve the accuracy of 4-bit PTQ of ViTs to a usable level.
arXiv Detail & Related papers (2022-12-16T02:52:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.