Fugu-MT 論文翻訳(概要): HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models

論文の概要: HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models

arxiv url: http://arxiv.org/abs/2604.07812v1
Date: Thu, 09 Apr 2026 05:09:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-10 18:34:05.707786
Title: HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models
Title（参考訳）: HAWK:マルチモーダルモデルにおける重要度を考慮した視覚的トーケンプルーニング
Authors: Qihui Zhu, Tao Zhang, Yuchen Wang, Zijian Wen, Mengjie Zhang, Shuangwu Chen, Xiaobin Tan, Jian Yang, Yang Liu, Zhenhua Dong, Xianzhi Yu, Yinfei Pan,
Abstract要約: マルチモーダル大言語モデル(MLLM)では、視覚トークンの急増は推論時間と計算オーバーヘッドを大幅に増加させる。ビジュアルトークンプルーニングは、冗長なビジュアルトークンを削除することでMLLM推論のコストを削減するための有望な戦略である。ホーク(英: Hawk)は、重要トークンの保持を最大化するために視覚タスクにおける注目ヘッドの重要性を知覚する、視覚的重要なトークンのプルーニング手法である。
参考スコア（独自算出の注目度）: 41.41768757204328
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In multimodal large language models (MLLMs), the surge of visual tokens significantly increases the inference time and computational overhead, making them impractical for real-time or resource-constrained applications. Visual token pruning is a promising strategy for reducing the cost of MLLM inference by removing redundant visual tokens. Existing research usually assumes that all attention heads contribute equally to the visual interpretation. However, our study reveals that different heads may capture distinct visual semantics and inherently play distinct roles in visual processing. In light of this observation, we propose HAWK, a head importance-aware visual token pruning method that perceives the varying importance of attention heads in visual tasks to maximize the retention of crucial tokens. By leveraging head importance weights and text-guided attention to assess visual token significance, HAWK effectively retains task-relevant visual tokens while removing redundant ones. The proposed HAWK is entirely training-free and can be seamlessly applied to various MLLMs. Extensive experiments on multiple mainstream vision-language benchmarks demonstrate that HAWK achieves state-of-the-art accuracy. When applied to Qwen2.5-VL, HAWK retains 96.0% of the original accuracy after pruning 80.2% of the visual tokens. Additionally, it reduces end-to-end latency to 74.4% of the original and further decreases GPU memory usage across the tested models. The code is available at https://github.com/peppery77/HAWK.git.
Abstract（参考訳）: マルチモーダル大規模言語モデル(MLLM)では、視覚トークンの急増は推論時間と計算オーバーヘッドを大幅に増加させ、リアルタイムやリソース制約のあるアプリケーションでは実用的ではない。ビジュアルトークンプルーニングは、冗長なビジュアルトークンを削除することでMLLM推論のコストを削減するための有望な戦略である。既存の研究は通常、すべての注意点が視覚的解釈に等しく寄与していると仮定する。しかし、本研究では、異なる頭部が異なる視覚的意味を捉え、本質的に視覚処理において異なる役割を担っていることを明らかにした。そこで本研究では,重要なトークンの保持を最大化するために,視覚的タスクにおける注目ヘッドの重要性の変動を認識できる頭部重要度認識型視覚トークンプルーニング法HAWKを提案する。頭部重みとテキスト誘導による注意力を利用して視覚的トークンの重要度を評価することにより、HAWKは、冗長なトークンを除去しながら、タスク関連視覚トークンを効果的に保持する。提案したHAWKは完全にトレーニングなしで、様々なMLLMにシームレスに適用できる。複数の主流のビジョン言語ベンチマークに関する大規模な実験は、HAWKが最先端の精度を達成することを示した。 Qwen2.5-VLに適用した場合、HAWKは、視覚トークンの80.2%をプルーニングした後、元の精度の96.0%を維持している。さらに、エンドツーエンドのレイテンシをオリジナルの74.4%に削減し、テストされたモデル全体のGPUメモリ使用量をさらに削減する。コードはhttps://github.com/peppery77/HAWK.gitで公開されている。

論文の概要: HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models

関連論文リスト