Fugu-MT 論文翻訳(概要): CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models

論文の概要: CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models

arxiv url: http://arxiv.org/abs/2605.13178v1
Date: Wed, 13 May 2026 08:40:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:27.919423
Title: CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models
Title（参考訳）: CLIPは、大規模視野モデルにおける効率的な画素グラウンドニングのためのトレーニング不要のトーケンプルーニングをトリックする
Authors: Sangin Lee, Yukyung Choi,
Abstract要約: LiteLVLMは、効率的なピクセルグラウンドディング推論のためのトレーニング不要でテキスト誘導型トークンプルーニング戦略である。 LiteLVLMは、さまざまなトークン予算において、既存のメソッドを5%以上上回ります。
参考スコア（独自算出の注目度）: 1.3750624267664158
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In large vision-language models, visual tokens typically constitute the majority of input tokens, leading to substantial computational overhead. To address this, recent studies have explored pruning redundant or less informative visual tokens for image understanding tasks. However, these methods struggle with pixel grounding tasks, where token importance is highly contingent on the input text. Through an in-depth analysis of CLIP, we observe that visual tokens located within referent regions often exhibit low similarity to the textual representation. Motivated by this insight, we introduce LiteLVLM, a training-free, text-guided token pruning strategy for efficient pixel grounding inference. By reversing the ranking of CLIP's visual-text similarity, LiteLVLM effectively retains visual tokens covering the referent regions, while recovering context tokens to enable clear foreground-background separation. Extensive experiments demonstrate that LiteLVLM significantly outperforms existing methods by over 5% across diverse token budgets. Without any training or fine-tuning, LiteLVLM maintains 90\% of the original performance with a 22% speedup and a 2.3x memory reduction. Our code is available at https://github.com/sejong-rcv/LiteLVLM.
Abstract（参考訳）: 大きな視覚言語モデルでは、視覚トークンは典型的には入力トークンの大部分を占め、計算オーバーヘッドがかなり大きくなる。これを解決するために、最近の研究では、画像理解タスクのための冗長な、あるいは、情報の少ない視覚トークンのプルーニングについて検討している。しかし,これらの手法は,入力テキスト上でトークンの重要性が高い画素グラウンドタスクと競合する。 CLIPの詳細な分析により,参照領域内に位置する視覚トークンは,テキスト表現との類似度が低いことが確認された。この知見に触発されたLiteLVLMは、効率的な画素グラウンド推定のためのトレーニング不要でテキスト誘導型トークンプルーニング戦略である。 CLIPのビジュアルテキスト類似性のランキングを逆転することで、LiteLVLMは参照領域をカバーするビジュアルトークンを効果的に保持し、コンテキストトークンを回復して、前景と背景の明確な分離を可能にする。大規模な実験により、LiteLVLMは様々なトークン予算で既存の手法を5%以上上回る性能を示した。トレーニングや微調整がなければ、LiteLVLMはオリジナルの性能の90%を22%のスピードアップと2.3倍のメモリ削減で維持する。私たちのコードはhttps://github.com/sejong-rcv/LiteLVLMで公開されています。

論文の概要: CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models

関連論文リスト