Fugu-MT 論文翻訳(概要): Towards Efficient VLMs: Information-Theoretic Driven Compression via Adaptive Structural Pruning

論文の概要: Towards Efficient VLMs: Information-Theoretic Driven Compression via Adaptive Structural Pruning

arxiv url: http://arxiv.org/abs/2511.19518v1
Date: Mon, 24 Nov 2025 03:37:14 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-26 17:37:04.066236
Title: Towards Efficient VLMs: Information-Theoretic Driven Compression via Adaptive Structural Pruning
Title（参考訳）: 効率的なVLMを目指して:適応的構造解析による情報理論駆動圧縮
Authors: Zhaoqi Xu, Yingying Zhang, Jian Li, Jianwei Guo, Qiannan Zhu, Hua Huang,
Abstract要約: InfoPruneは視覚言語モデルの適応的構造圧縮のための情報理論フレームワークである。 VQAv2、TextVQA、GQAの実験では、InfoPruneは最大3.2倍のFLOPと1.8倍のアクセラレーションを達成でき、性能劣化は無視できる。
参考スコア（独自算出の注目度）: 38.7577454874686
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in vision-language models (VLMs) have shown remarkable performance across multimodal tasks, yet their ever-growing scale poses severe challenges for deployment and efficiency. Existing compression methods often rely on heuristic importance metrics or empirical pruning rules, lacking theoretical guarantees about information preservation. In this work, we propose InfoPrune, an information-theoretic framework for adaptive structural compression of VLMs. Grounded in the Information Bottleneck principle, we formulate pruning as a trade-off between retaining task-relevant semantics and discarding redundant dependencies. To quantify the contribution of each attention head, we introduce an entropy-based effective rank (eRank) and employ the Kolmogorov--Smirnov (KS) distance to measure the divergence between original and compressed structures. This yields a unified criterion that jointly considers structural sparsity and informational efficiency. Building on this foundation, we further design two complementary schemes: (1) a training-based head pruning guided by the proposed information loss objective, and (2) a training-free FFN compression via adaptive low-rank approximation. Extensive experiments on VQAv2, TextVQA, and GQA demonstrate that InfoPrune achieves up to 3.2x FLOP reduction and 1.8x acceleration with negligible performance degradation, establishing a theoretically grounded and practically effective step toward efficient multimodal large models.
Abstract（参考訳）: 視覚言語モデル(VLM)の最近の進歩は、マルチモーダルタスクにまたがる顕著なパフォーマンスを示しているが、その成長を続けるスケールは、デプロイメントと効率に深刻な課題をもたらす。既存の圧縮法は、情報保存に関する理論的保証が欠如しているため、しばしばヒューリスティックな重要性の指標や経験的なプルーニング規則に依存している。本稿では,VLMの適応的構造圧縮のための情報理論フレームワークであるInfoPruneを提案する。 Information Bottleneckの原則に基づいて、タスク関連セマンティクスの保持と冗長な依存関係の破棄のトレードオフとしてプルーニングを定式化します。それぞれのアテンションヘッドの寄与を定量化するために、エントロピーに基づく有効ランク(eRank)を導入し、KS距離を用いて元の構造と圧縮された構造の間のばらつきを測定する。これにより、構造的空間性と情報効率を共同で考慮する統一的な基準が得られる。この基盤を基盤として,(1)情報損失目標に導かれるトレーニングベースヘッドプルーニング,(2)適応型低ランク近似によるトレーニングフリーFFN圧縮の2つの相補的スキームを設計する。 VQAv2、TextVQA、GQAの広範囲にわたる実験により、InfoPruneは最大3.2倍のFLOP削減と1.8倍の加速を無視可能な性能劣化で達成し、理論上は基礎的かつ実用的なステップを確立し、効率的なマルチモーダル大モデルに向けてのステップを確立した。

論文の概要: Towards Efficient VLMs: Information-Theoretic Driven Compression via Adaptive Structural Pruning

関連論文リスト