Bi-VLM: Pushing Ultra-Low Precision Post-Training Quantization Boundaries in Vision-Language Models
- URL: http://arxiv.org/abs/2509.18763v1
- Date: Tue, 23 Sep 2025 07:55:48 GMT
- Title: Bi-VLM: Pushing Ultra-Low Precision Post-Training Quantization Boundaries in Vision-Language Models
- Authors: Xijun Wang, Junyun Huang, Rayyan Abdalla, Chengyuan Zhang, Ruiqi Xian, Dinesh Manocha,
- Abstract summary: We propose Bi-VLM, which separates model weights non-uniformly based on the Gaussian quantiles.<n>For the language model part of the VLM, our Bi-VLM outperforms the SOTA by 3%-47% on the visual question answering task.<n>For the overall VLM, our Bi-VLM outperforms the SOTA by 4%-45%.
- Score: 41.569153064451385
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We address the critical gap between the computational demands of vision-language models and the possible ultra-low-bit weight precision (bitwidth $\leq2$ bits) we can use for higher efficiency. Our work is motivated by the substantial computational cost and memory requirements of VLMs, which restrict their applicability in hardware-constrained environments. We propose Bi-VLM, which separates model weights non-uniformly based on the Gaussian quantiles. Our formulation groups the model weights into outlier (salient) and multiple inlier (unsalient) subsets, ensuring that each subset contains a proportion of weights corresponding to its quantile in the distribution. We propose a saliency-aware hybrid quantization algorithm and use it to quantize weights by imposing different constraints on the scaler and binary matrices based on the saliency metric and compression objective. We have evaluated our approach on different VLMs. For the language model part of the VLM, our Bi-VLM outperforms the SOTA by 3%-47% on the visual question answering task in terms of four different benchmarks and three different models. For the overall VLM, our Bi-VLM outperforms the SOTA by 4%-45%. We also perform token pruning on the quantized models and observe that there is redundancy of image tokens 90% - 99% in the quantized models. This helps us to further prune the visual tokens to improve efficiency.
Related papers
- QVLA: Not All Channels Are Equal in Vision-Language-Action Model's Quantization [29.21308068128823]
We introduce QVLA, the first action-centric quantization framework specifically designed for embodied control.<n>Our work establishes a new, principled foundation for compressing Vision-Language-Action models in robotics.
arXiv Detail & Related papers (2026-02-03T17:43:45Z) - QSVD: Efficient Low-rank Approximation for Unified Query-Key-Value Weight Compression in Low-Precision Vision-Language Models [13.850959421148273]
Vision-Language Models (VLMs) are integral to tasks such as image captioning and visual question answering.<n>Their high computational cost, driven by large memory footprints and processing time, limits their scalability and real-time applicability.<n>We propose leveraging Singular-Value Decomposition (SVD) over the joint query (Q), key (K), and value (V) weight matrices to reduce KV cache size and computational overhead.
arXiv Detail & Related papers (2025-10-18T01:31:14Z) - AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model [40.488271586857884]
AndesVL is a suite of mobile-side MLLMs with 0.6B to 4B parameters based on Qwen3's LLM and various visual encoders.<n>We introduce a 1+N LoRA architecture alongside a Quantization-Aware LoRA Fine-Tuning framework to facilitate efficient task adaptation and model compression.<n>We achieve a 6.7x peak decoding speedup ratio, up to 30.9% memory reduction, and 1.8 bits-per-weight when deploying AndesVL-4B on MediaTek Dimensity 9500 chips.
arXiv Detail & Related papers (2025-10-13T15:04:38Z) - LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models [59.0256377330646]
Lens is a benchmark with 3.4K contemporary images and 60K+ human-authored questions covering eight tasks and 12 daily scenarios.<n>This dataset intrinsically supports to evaluate MLLMs to handle image-invariable prompts, from basic perception to compositional reasoning.<n>We evaluate 15+ frontier MLLMs such as Qwen2.5-VL-72B, InternVL3-78B, GPT-4o and two reasoning models QVQ-72B-preview and Kimi-VL.
arXiv Detail & Related papers (2025-05-21T15:06:59Z) - MBQ: Modality-Balanced Quantization for Large Vision-Language Models [20.018652727875367]
Post-Training Quantization (PTQ) is an effective technique to reduce the memory and computation overhead.<n>Existing PTQ methods mainly focus on large language models (LLMs), without considering the differences across other modalities.<n>We propose Modality-Balanced Quantization (MBQ) for large vision-language models.
arXiv Detail & Related papers (2024-12-27T07:55:36Z) - Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance [78.48606021719206]
Mini-InternVL is a series of MLLMs with parameters ranging from 1B to 4B, which achieves 90% of the performance with only 5% of the parameters.
We develop a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks.
arXiv Detail & Related papers (2024-10-21T17:58:20Z) - VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks [60.5257456681402]
We study the potential for building universal embeddings capable of handling a wide range of downstream tasks.<n>We build a series of VLM2Vec models on SoTA VLMs like Phi-3.5-V, LLaVA-1.6 and evaluate them on MMEB's evaluation split.<n>Our results show that VLM2Vec achieves an absolute average improvement of 10% to 20% over existing multimodal embedding models.
arXiv Detail & Related papers (2024-10-07T16:14:05Z) - LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid [36.33062038680275]
Large language models (LLMs) have shown immense potential across various domains.
Post-training quantization has emerged as a promising technique to reduce memory requirements and decoding latency.
We propose LeanQuant, a novel quantization method that is accurate, versatile, and scalable.
arXiv Detail & Related papers (2024-07-14T00:23:51Z) - PTQ4SAM: Post-Training Quantization for Segment Anything [28.893095276574893]
Segment Anything Model (SAM) has achieved impressive performance in many computer vision tasks.
However, as a large-scale model, the immense memory and computation costs hinder its practical deployment.
We propose a post-training quantization framework for Segment Anything Model, namely PTQ4SAM.
arXiv Detail & Related papers (2024-05-06T03:39:50Z) - ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models [69.50316788263433]
We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained vision-language models.
We quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods.
We present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model.
arXiv Detail & Related papers (2023-07-01T18:16:06Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.