Fugu-MT 論文翻訳(概要): Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency

論文の概要: Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency

arxiv url: http://arxiv.org/abs/2604.24380v1
Date: Mon, 27 Apr 2026 12:10:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:07.984724
Title: Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency
Title（参考訳）: 大規模視覚言語モデルの構造的プルーニング: プルーニングダイナミクス, 回復, データの効率に関する総合的研究
Authors: Yiran Huang, Lukas Thede, Massimiliano Mancini, Wenjia Xu, Zeynep Akata,
Abstract要約: 大規模ビジョン言語モデル(LVLM)は、目覚ましい能力を示し、その相当な計算とメモリ要件は、リソース制約されたエッジデバイスへのデプロイメントに挑戦する。言語モデルバックボーンに構造化プルーニングを適用して既存のLVLMを圧縮し,その後に軽量回復訓練を行った。
参考スコア（独自算出の注目度）: 65.07757274822207
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Large Vision Language Models (LVLMs) demonstrate impressive capabilities, their substantial computational and memory requirements pose deployment challenges on resource-constrained edge devices. Current parameter reduction techniques primarily involve training LVLMs from small language models, but these methods offer limited flexibility and remain computationally intensive. We study a complementary route: compressing existing LVLMs by applying structured pruning to the language model backbone, followed by lightweight recovery training. Specifically, we investigate two structural pruning paradigms: layerwise and widthwise pruning, and pair them with supervised finetuning and knowledge distillation on logits and hidden states. Additionally, we assess the feasibility of conducting recovery training with only a small fraction of the available data. Our results show that widthwise pruning generally maintains better performance in low-resource scenarios, where computational resources are limited or there is insufficient finetuning data. As for the recovery training, finetuning only the multimodal projector is sufficient at small compression levels. Furthermore, a combination of supervised finetuning and hidden-state distillation yields optimal recovery across various pruning levels. Notably, effective recovery can be achieved using just 5% of the original data, while retaining over 95% of the original performance. Through empirical study on three representative LVLM families ranging from 3B to 7B parameters, this study offers actionable insights for practitioners to compress LVLMs without extensive computation resources or sufficient data. The code base is available at https://github.com/YiranHuangIrene/VLMCompression.git.
Abstract（参考訳）: LVLM(Large Vision Language Models)は印象的な能力を示すが、その計算とメモリの要求はリソースに制約のあるエッジデバイスへの展開に困難をもたらす。現在のパラメータ削減技術は、主に小さな言語モデルからLVLMを訓練するが、これらの手法は柔軟性を制限し、計算集約性を維持する。言語モデルバックボーンに構造化プルーニングを適用して既存のLVLMを圧縮し,その後に軽量回復訓練を行った。具体的には,2つの構造的プルーニングパラダイム(層状および幅的プルーニング)について検討し,これらと,ロジットおよび隠蔽状態における教師付き微調整と知識蒸留を組み合わせて検討する。さらに、利用可能なデータのごく一部で回復訓練を行うことの可能性を評価する。以上の結果から,計算資源の制限や微調整が不十分な低リソースシナリオでは,ワイドワイドプルーニングが優れた性能を維持することが示唆された。回復訓練では,小型圧縮レベルではマルチモーダルプロジェクタのみの微調整が十分である。さらに, 監督型微粒化法と隠蔽型蒸留法の組み合わせにより, 各種プルーニングレベルの最適回収が可能となった。特に、有効なリカバリは、元のデータのわずか5%で達成でき、元のパフォーマンスの95%以上を維持している。 3Bから7Bパラメータの3つの代表的なLVLMファミリーに関する実証的研究を通じて,広い計算資源や十分なデータなしでLVLMを圧縮する実践者に対して,実用的な知見を提供する。コードベースはhttps://github.com/YiranHuangIrene/VLMCompression.gitで公開されている。

論文の概要: Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency

関連論文リスト