Fugu-MT 論文翻訳(概要): Bi-VLM: Pushing Ultra-Low Precision Post-Training Quantization Boundaries in Vision-Language Models

論文の概要: Bi-VLM: Pushing Ultra-Low Precision Post-Training Quantization Boundaries in Vision-Language Models

arxiv url: http://arxiv.org/abs/2509.18763v1
Date: Tue, 23 Sep 2025 07:55:48 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-24 20:41:27.761895
Title: Bi-VLM: Pushing Ultra-Low Precision Post-Training Quantization Boundaries in Vision-Language Models
Title（参考訳）: Bi-VLM:Vision-Languageモデルにおける極低精度後量子化境界のプッシュ
Authors: Xijun Wang, Junyun Huang, Rayyan Abdalla, Chengyuan Zhang, Ruiqi Xian, Dinesh Manocha,
Abstract要約: 本稿では,ガウス量子論に基づくモデル重みを非一様に分離するBi-VLMを提案する。 VLMの言語モデルでは、視覚的質問応答タスクにおいて、私たちのBi-VLMは、SOTAよりも3%-47%優れています。 VLM全体では、私たちのBi-VLMはSOTAよりも4%-45%優れています。
参考スコア（独自算出の注目度）: 41.569153064451385
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We address the critical gap between the computational demands of vision-language models and the possible ultra-low-bit weight precision (bitwidth $\leq2$ bits) we can use for higher efficiency. Our work is motivated by the substantial computational cost and memory requirements of VLMs, which restrict their applicability in hardware-constrained environments. We propose Bi-VLM, which separates model weights non-uniformly based on the Gaussian quantiles. Our formulation groups the model weights into outlier (salient) and multiple inlier (unsalient) subsets, ensuring that each subset contains a proportion of weights corresponding to its quantile in the distribution. We propose a saliency-aware hybrid quantization algorithm and use it to quantize weights by imposing different constraints on the scaler and binary matrices based on the saliency metric and compression objective. We have evaluated our approach on different VLMs. For the language model part of the VLM, our Bi-VLM outperforms the SOTA by 3%-47% on the visual question answering task in terms of four different benchmarks and three different models. For the overall VLM, our Bi-VLM outperforms the SOTA by 4%-45%. We also perform token pruning on the quantized models and observe that there is redundancy of image tokens 90% - 99% in the quantized models. This helps us to further prune the visual tokens to improve efficiency.
Abstract（参考訳）: 視覚言語モデルの計算要求と超低ビット重量精度(bitwidth $\leq2$ bits)の間には、高い効率で使用できる重要なギャップがある。我々の研究は、ハードウェア制約環境におけるVLMの適用性を制限する、相当な計算コストとメモリ要求によって動機付けられている。本稿では,ガウス量子論に基づくモデル重みを非一様に分離するBi-VLMを提案する。我々の定式化はモデルウェイトをアウトリー(正則)と複数インリー(非正則)のサブセットに分類し、各サブセットがその分布におけるその量子化に対応する重みの比率を確実にする。本稿では,サリエンシを意識したハイブリッド量子化アルゴリズムを提案し,サリエンシ計量と圧縮目標に基づいて,スケーラとバイナリ行列に異なる制約を課すことにより重みを定量化する。我々は、異なるVLMに対するアプローチを評価した。 VLMの言語モデル部では、4つの異なるベンチマークと3つの異なるモデルで視覚的質問応答タスクにおいて、私たちのBi-VLMは、SOTAよりも3%-47%優れています。 VLM全体では、私たちのBi-VLMはSOTAよりも4%-45%優れています。また、量子化モデル上でトークンプルーニングを行い、量子化モデルには90%から99%の画像トークンの冗長性があることを観察する。これにより、視覚的なトークンをさらに掘り下げて効率を向上させることができます。

論文の概要: Bi-VLM: Pushing Ultra-Low Precision Post-Training Quantization Boundaries in Vision-Language Models

関連論文リスト