Fugu-MT 論文翻訳(概要): Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models

論文の概要: Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models

arxiv url: http://arxiv.org/abs/2603.05950v1
Date: Fri, 06 Mar 2026 06:32:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 13:17:45.187251
Title: Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models
Title（参考訳）: 効率的なビジョンランゲージモデルのためのエネルギー駆動型適応型視覚トーケンプルーニング
Authors: Jialuo He, Huangxun Chen,
Abstract要約: 視覚特徴空間の特異値スペクトルからトークン予算を決定するエネルギー駆動型適応型プルーニングフレームワークであるE-AdaPruneを提案する。 E-AdaPruneは、MMVet推論タスクの5.1%の大幅な向上を含む、平均0.6%の改善を継続的に達成している。
参考スコア（独自算出の注目度）: 7.641622965415444
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual token reduction is critical for accelerating Vision-Language Models (VLMs), yet most existing approaches rely on a fixed budget shared across all inputs, overlooking the substantial variation in image information density. We propose E-AdaPrune, an energy-driven adaptive pruning framework that determines the token budget from the singular value spectrum of the visual features space. By preserving a certain proportion of spectral energy, our method allocates more tokens to information-dense scenes while aggressively compressing redundant ones, without introducing additional learnable parameters. We evaluate E-AdaPrune on nine benchmarks and three VLM backbones, LLaVA-1.5-7B, LLaVA-1.5-13B, and LLaVA-NeXT-8B. Under matched average token budgets, E-AdaPrune consistently yields an average improvement of up to 0.6\%, including a significant +5.1\% relative boost on the MMVet reasoning task. Using randomized singular value decomposition, the additional latency is limited to 8ms per image.
Abstract（参考訳）: VLM(Vision-Language Models)の加速には視覚トークンの削減が不可欠だが、既存のアプローチはすべての入力で共有される固定予算に依存しており、画像情報密度のかなりの変動を見越している。視覚特徴空間の特異値スペクトルからトークン予算を決定するエネルギー駆動型適応型プルーニングフレームワークであるE-AdaPruneを提案する。スペクトルエネルギーの一定割合を保存することにより、学習可能なパラメータを追加することなく、余分な部分を積極的に圧縮しながら、情報密度のシーンにより多くのトークンを割り当てる。 9つのベンチマークと3つのVLMバックボーン、LLaVA-1.5-7B、LLaVA-1.5-13B、LLaVA-NeXT-8BでE-AdaPruneを評価する。一致した平均トークン予算の下では、E-AdaPruneは、MMVet推論タスクにおける重要な+5.1\%の相対的な増加を含む、常に0.6\%の平均的な改善を得られる。ランダム化特異値分解を用いると、追加のレイテンシは画像あたり8msに制限される。

論文の概要: Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models

関連論文リスト