Fugu-MT 論文翻訳(概要): One Layer's Trash is Another Layer's Treasure: Adaptive Layer-wise Visual Token Selection in LVLMs

論文の概要: One Layer's Trash is Another Layer's Treasure: Adaptive Layer-wise Visual Token Selection in LVLMs

arxiv url: http://arxiv.org/abs/2606.14277v1
Date: Fri, 12 Jun 2026 08:58:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-15 16:00:42.845538
Title: One Layer's Trash is Another Layer's Treasure: Adaptive Layer-wise Visual Token Selection in LVLMs
Title（参考訳）: 一つのレイヤのトラッシュは別のレイヤの宝物:LVLMにおける適応的なレイヤワイド視覚トークン選択
Authors: Yongru Chen, Kai Zhang, Zeliang Zong, Yuchen Lu, Wenming Tan, Ye Ren, Jilin Hu,
Abstract要約: LVLM(Large Vision-Language Models)は様々なマルチモーダルタスクにまたがって大きな成功を収めているが、その実際の展開は長い視覚トークンから生じる計算負担によって制約されている。本稿では,従来の静的トークンプルーニングパラダイムから切り離された新しいフレームワークであるAdaptive Layer-wise Visual Token Selection (ALVTS)を提案する。 89%のトークン圧縮比で、ALVTSはオリジナルのモデルの96.7%の精度を維持しており、LVLM推論の効率と精度のトレードオフが優れている。
参考スコア（独自算出の注目度）: 18.48496973561215
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Vision-Language Models (LVLMs) have achieved remarkable success across diverse multimodal tasks, yet their practical deployment remains constrained by the computational burden arising from lengthy visual tokens. While visual token pruning has emerged as a promising solution, existing methods suffer from a fundamental limitation: once tokens are pruned at a specific layer, they become inaccessible to all subsequent layers, leading to premature information loss that can compromise model performance. Through empirical studies, we observe that different layers exhibit distinct visual region focus, indicating a varying optimal token subset across layers. Motivated by this insight, we propose Adaptive Layer-wise Visual Token Selection (ALVTS), a novel framework that breaks away from the conventional static token pruning paradigm. ALVTS incorporates a lightweight token selector to identify and route important tokens for further processing, while allowing less important tokens to skip the layer, thus minimizing computational redundancy. These two streams of tokens are seamlessly reintegrated before being fed into subsequent layers, facilitating adaptive compression across the entire model. Grounded in our importance consistency constrained low-rank approximation, the proposed token selection module closely emulates the full attention mechanism, effectively capturing its essential patterns without requiring model retraining. Extensive experiments on LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL validate the effectiveness of our method. With an 89% token compression ratio, ALVTS retains 96.7% of the original model's accuracy, achieving a superior efficiency-accuracy trade-off for LVLM inference.
Abstract（参考訳）: LVLM(Large Vision-Language Models)は様々なマルチモーダルタスクにまたがって大きな成功を収めているが、その実際の展開は長い視覚トークンから生じる計算負担によって制約されている。ビジュアルトークンのプルーニングは有望なソリューションとして現れているが、既存のメソッドは基本的な制限に悩まされている。トークンが特定の層でプルーニングされると、後続のすべてのレイヤにアクセスできなくなり、モデルのパフォーマンスを損なう早すぎる情報損失につながる。実験により,異なる層が異なる視覚領域の焦点を示し,各層に異なる最適なトークンサブセットを示すことが確認された。この知見に触発されて,従来の静的トークンプルーニングパラダイムから切り離された新しいフレームワークであるAdaptive Layer-wise Visual Token Selection (ALVTS)を提案する。 ALVTSには軽量なトークンセレクタが組み込まれており、重要なトークンを識別・ルーティングしてさらなる処理を行うと同時に、重要でないトークンがレイヤをスキップすることを可能にし、計算冗長性を最小化することができる。これら2つのトークンストリームは、後続のレイヤにフィードする前にシームレスに再統合され、モデル全体の適応圧縮が容易になる。重要度制約付き低ランク近似を基礎として,提案したトークン選択モジュールは,その全注意機構を密にエミュレートし,モデル再トレーニングを必要とせずに本質的なパターンを効果的に捕捉する。 LLaVA-1.5, LLaVA-NeXT, Qwen2.5-VLの大規模実験により, 本法の有効性が検証された。 89%のトークン圧縮比で、ALVTSはオリジナルのモデルの96.7%の精度を維持しており、LVLM推論の効率と精度のトレードオフが優れている。

論文の概要: One Layer's Trash is Another Layer's Treasure: Adaptive Layer-wise Visual Token Selection in LVLMs

関連論文リスト