Related papers: QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

URL: http://arxiv.org/abs/2602.20309v3
Date: Fri, 27 Feb 2026 19:38:51 GMT
Title: QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models
Authors: Jingxuan Zhang, Yunta Hsieh, Zhongwei Wan, Haokun Lin, Xin Wang, Ziqi Wang, Yingtie Lei, Mi Zhang,
Abstract summary: Vision-language-action (VLA) models unify perception, language, and control for embodied agents.<n>We introduce QuantVLA, a training-free post-training quantization framework.<n>It is the first PTQ approach for VLA systems and the first to successfully quantize a diffusion transformer (DiT) action head.
Score: 21.01470580488428
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-language-action (VLA) models unify perception, language, and control for embodied agents but face significant challenges in practical deployment due to rapidly increasing compute and memory demands, especially as models scale to longer horizons and larger backbones. To address these bottlenecks, we introduce QuantVLA, a training-free post-training quantization (PTQ) framework that, to our knowledge, is the first PTQ approach for VLA systems and the first to successfully quantize a diffusion transformer (DiT) action head. QuantVLA incorporates three scale-calibrated components: (1) a selective quantization layout that integerizes all linear layers in both the language backbone and the DiT while keeping attention projections in floating point to preserve the original operator schedule; (2) attention temperature matching, a lightweight per-head scaling mechanism that stabilizes attention logits and is folded into the dequantization scales at inference; and (3) output head balancing, a per-layer residual interface calibration that mitigates post-projection energy drift. The framework requires no additional training, uses only a small unlabeled calibration buffer, and supports integer kernels for low-bit weights and activations while leaving the architecture unchanged. Across representative VLA models on LIBERO, QuantVLA exceeds the task success rates of full-precision baselines, achieves about 70% relative memory savings on the quantized components, providing a practical pathway toward scalable low-bit embodied intelligence under strict compute, memory, and power constraints.

Related papers

HBVLA: Pushing 1-Bit Post-Training Quantization for Vision-Language-Action Models [11.913553037277472]
Vision-Language-Action (VLA) models enable instruction-following embodied control.<n>Current methods fail to narrow the distribution gap between binarized and full-precision weights.<n>We propose HBVLA, a VLA-tailored binarization framework.
arXiv Detail & Related papers (2026-02-14T10:23:45Z)
Rethinking Practical and Efficient Quantization Calibration for Vision-Language Models [11.411411301593011]
Post-training quantization (PTQ) is a primary approach for deploying large language models without fine-tuning.<n>We propose the Token-level Importance-aware Layer-wise Quantization framework (TLQ)<n> TLQ is evaluated across two models, three model scales, and two quantization settings, consistently achieving performance improvements across all settings.
arXiv Detail & Related papers (2026-02-08T10:19:25Z)
QVLA: Not All Channels Are Equal in Vision-Language-Action Model's Quantization [29.21308068128823]
We introduce QVLA, the first action-centric quantization framework specifically designed for embodied control.<n>Our work establishes a new, principled foundation for compressing Vision-Language-Action models in robotics.
arXiv Detail & Related papers (2026-02-03T17:43:45Z)
D$^2$Quant: Accurate Low-bit Post-Training Weight Quantization for LLMs [33.883527341335856]
Weight-only post-training quantization (PTQ) is appealing as it reduces memory usage and enables practical speedup without low-bit operators or specialized hardware.<n> accuracy often degrades significantly in weight-only PTQ at sub-4-bit precision.<n>We propose D$2$Quant, a novel weight-only PTQ framework that improves quantization from both the weight and activation perspectives.
arXiv Detail & Related papers (2026-01-30T05:49:48Z)
Quantized Visual Geometry Grounded Transformer [67.15451442018258]
This paper proposes the first Quantization framework for VGGTs, namely QuantVGGT.<n>We introduce Dual-Smoothed Fine-Grained Quantization, which integrates pre-global Hadamard rotation and post-local channel smoothing.<n>We also design Noise-Filtered Diverse Sampling, which filters outliers via deep-layer statistics.
arXiv Detail & Related papers (2025-09-25T15:17:11Z)
MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z)
SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration [70.72227437717467]
Vision-Language-Action (VLA) models have attracted increasing attention for their strong control capabilities.<n>Their high computational cost and low execution frequency hinder their suitability for real-time tasks such as robotic manipulation and autonomous navigation.<n>We propose SP-VLA, a unified framework that accelerates VLA models by jointly scheduling models and pruning tokens.
arXiv Detail & Related papers (2025-06-15T05:04:17Z)
GPLQ: A General, Practical, and Lightning QAT Method for Vision Transformers [11.452135395287119]
Vision Transformers (ViTs) are essential in computer vision but are computationally intensive, too.<n>Model quantization aims to alleviate this difficulty, yet existing Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) methods exhibit significant limitations.<n>This paper introduces General, Practical, and Quantization (GPLQ), a novel framework for efficient ViT quantization.
arXiv Detail & Related papers (2025-06-13T13:45:17Z)
Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics [64.62231094774211]
Statefuls (e.g., Adam) maintain auxiliary information even 2x the model size in order to achieve optimal convergence.<n>SOLO enables Adam-styles to maintain quantized states with precision as low as 3 bits, or even 2 bits.<n>SOLO can thus be seamlessly applied to Adam-styles, leading to substantial memory savings with minimal accuracy loss.
arXiv Detail & Related papers (2025-05-01T06:47:45Z)
RSQ: Learning from Important Tokens Leads to Better Quantized LLMs [65.5558181902098]
Layer-wise quantization is a key technique for efficiently compressing large models without expensive retraining.<n>We propose RSQ (Rotate, Scale, then Quantize), which applies rotations to the model to mitigate outliers.<n>We demonstrate that RSQ consistently outperforms baseline methods across multiple downstream tasks and three model families.
arXiv Detail & Related papers (2025-03-03T18:46:33Z)
WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z)
CBQ: Cross-Block Quantization for Large Language Models [66.82132832702895]
Post-training quantization (PTQ) has played a key role in compressing large language models (LLMs) with ultra-low costs.<n>We propose CBQ, a cross-block reconstruction-based PTQ method for LLMs.<n> CBQ employs a cross-block dependency using a reconstruction scheme, establishing long-range dependencies across multiple blocks to minimize error accumulation.
arXiv Detail & Related papers (2023-12-13T07:56:27Z)
Gradient $\ell_1$ Regularization for Quantization Robustness [70.39776106458858]
We derive a simple regularization scheme that improves robustness against post-training quantization. By training quantization-ready networks, our approach enables storing a single set of weights that can be quantized on-demand to different bit-widths.
arXiv Detail & Related papers (2020-02-18T12:31:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.