Fugu-MT 論文翻訳(概要): ReDiPrune: Relevance-Diversity Pre-Projection Token Pruning for Efficient Multimodal LLMs

論文の概要: ReDiPrune: Relevance-Diversity Pre-Projection Token Pruning for Efficient Multimodal LLMs

arxiv url: http://arxiv.org/abs/2603.24680v1
Date: Wed, 25 Mar 2026 18:01:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-27 20:52:47.923004
Title: ReDiPrune: Relevance-Diversity Pre-Projection Token Pruning for Efficient Multimodal LLMs
Title（参考訳）: ReDiPrune: 効率的なマルチモーダルLCMのためのレバレンス・ダイバーシティ・プレプロジェクション・トケニング
Authors: An Yu, Ting Yu Tsai, Zhenfei Zhang, Weiheng Lu, Felix X. -F. Ye, Ming-Ching Chang,
Abstract要約: ReDiPrune(ReDiPrune)は、視覚言語プロジェクタに適用される、トレーニング不要のトークンプルーニング手法である。視覚エンコーダ出力から直接情報トークンを選択し、きめ細かい空間的および意味的な手がかりを保存する。 4つのビデオと5つの画像ベンチマークの精度と効率のトレードオフを一貫して改善する。
参考スコア（独自算出の注目度）: 16.523460406504604
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent multimodal large language models are computationally expensive because Transformers must process a large number of visual tokens. We present \textbf{ReDiPrune}, a training-free token pruning method applied before the vision-language projector, where visual features remain rich and discriminative. Unlike post-projection pruning methods that operate on compressed representations, ReDiPrune selects informative tokens directly from vision encoder outputs, preserving fine-grained spatial and semantic cues. Each token is scored by a lightweight rule that jointly consider text-conditioned relevance and max-min diversity, ensuring the selected tokens are both query-relevant and non-redundant. ReDiPrune is fully plug-and-play, requiring no retraining or architectural modifications, and can be seamlessly inserted between the encoder and projector. Across four video and five image benchmarks, it consistently improves the accuracy-efficiency trade-off. For example, on EgoSchema with LLaVA-NeXT-Video-7B, retaining only 15\% of visual tokens yields a +2.0\% absolute accuracy gain while reducing computation by more than $6\times$ in TFLOPs. Code is available at https://github.com/UA-CVML/ReDiPrune.
Abstract（参考訳）: 最近のマルチモーダルな大言語モデルは、トランスフォーマーが大量の視覚トークンを処理しなければならないため、計算コストが高い。本稿では,視覚的特徴が豊かで識別可能な,視覚言語プロジェクタの前に適用されたトレーニング不要なトークンプルーニング手法である「textbf{ReDiPrune}」を提案する。 ReDiPruneは圧縮表現で操作する投射後のプルーニング法とは異なり、視覚エンコーダ出力から直接情報トークンを選択し、きめ細かい空間的および意味的な手がかりを保存する。各トークンは、テキスト条件の関連性と最大値の多様性を共同で考慮し、選択されたトークンがクエリ関連かつ非冗長であることを保証する軽量なルールによってスコア付けされる。 ReDiPruneは完全にプラグアンドプレイで、再トレーニングやアーキテクチャの変更は不要で、エンコーダとプロジェクタの間にシームレスに挿入できる。 4つのビデオと5つの画像ベンチマークで、精度と効率のトレードオフが一貫して改善されている。例えば、LLaVA-NeXT-Video-7B の EgoSchema では、視覚トークンの 15 % しか保持していないため、TFLOPs の計算を 6 ドル以上削減しながら、絶対精度は +2.0 % 向上する。コードはhttps://github.com/UA-CVML/ReDiPrune.comで入手できる。

論文の概要: ReDiPrune: Relevance-Diversity Pre-Projection Token Pruning for Efficient Multimodal LLMs

関連論文リスト